AI & DATA
Clean article content
ready for LLMs
Strip navigation, ads, footers, and scripts. Extract clean title, body, author, and date from any URL. Powered by Mozilla’s Readability — the same algorithm Firefox Reader Mode uses.
~5KB
Avg Clean Article Size
~90%
Size Reduction vs Raw HTML
$0.005
Per Article
JSON
LLM-Ready Output
USE CASES
What you can do with this data
RAG Pipeline Ingestion
Feed clean article text into vector databases for RAG. 90% less noise than raw HTML means cleaner embeddings and better retrieval.
Content Summarization at Scale
Bulk-process 1000s of URLs into clean text, then summarize with GPT-4/Claude. Token cost drops ~90% vs raw HTML.
News Monitoring
Extract clean text from news URLs for brand monitoring, competitive intelligence, sentiment analysis.
Dataset Building for ML
Build training datasets from web articles. Clean text + metadata (author, date, site) for fine-tuning or classification.
Newsletter Curation
Pull clean body text from shared links. Auto-summarize, auto-tag, auto-publish to your newsletter.
Research Assistant Apps
Build apps that read articles for users. Integrate with Claude, GPT, Gemini to extract then answer questions about content.
OUTPUT FIELDS
Fields returned per article
Cleaned title
Clean body text (no ads/nav)
Byline / author
Published date
Main image URL
Language detected
Word count
Excerpt (first 200 chars)
Site name
Canonical URL
All inline images
Reading time estimate
HOW IT WORKS
Three steps to structured data
Pass URL(s)
Single URL or list of 1000s. Also works on RSS feed entries.
Extraction runs
Fetches page, applies Mozilla Readability, strips nav/ads/scripts, returns structured text.
Use in your stack
JSON output is ready for OpenAI embeddings, Claude, Gemini, or any vector DB (Pinecone, Weaviate, Qdrant).
COMPARISON
Why this actor vs alternatives
| Feature | This Actor | Firecrawl | Diffbot |
|---|---|---|---|
| Price per article | $0.005 | ~$0.006–$0.010 | ~$0.010 |
| Extraction algorithm | Mozilla Readability | Proprietary + LLM | Proprietary ML |
| LLM-readable output | Clean markdown + JSON | Markdown | JSON |
| Metadata (author/date) | Yes | Yes | Yes |
| JavaScript-heavy sites | Yes (headless browser) | Yes | Yes |
| Free tier | Apify $5 trial credit | 500/mo free | Limited |
FAQ
Frequently asked questions
What is Mozilla Readability?
Open-source algorithm Firefox Reader Mode uses to extract clean article content. Industry standard — works on 90% of editorial sites.
Does it handle paywalled articles?
It extracts whatever the page returns. For soft paywalls (teaser visible), you get the teaser. For hard paywalls, you get the login prompt.
What about JS-heavy sites?
Optional headless browser mode renders full JavaScript before extraction. Slightly slower, +~$0.002/article.
Why use this vs Firecrawl?
Pricing: ~25% cheaper at scale. For LLM/RAG pipelines the cost difference compounds fast at 10k+ URLs/month.
Can I batch 10,000 URLs?
Yes — no limit. Apify runs it in parallel with automatic retry. Typical throughput 500–1000 articles/min per actor instance.
START NOW
Clean article text, ready for your LLM
Cut LLM token costs 90% by feeding clean content, not raw HTML.
Extract Articles Free →