AI & DATA

Clean article content
ready for LLMs

Strip navigation, ads, footers, and scripts. Extract clean title, body, author, and date from any URL. Powered by Mozilla’s Readability — the same algorithm Firefox Reader Mode uses.

~5KB

Avg Clean Article Size

~90%

Size Reduction vs Raw HTML

$0.005

Per Article

JSON

LLM-Ready Output

USE CASES

What you can do with this data

🧬

RAG Pipeline Ingestion

Feed clean article text into vector databases for RAG. 90% less noise than raw HTML means cleaner embeddings and better retrieval.

📈

Content Summarization at Scale

Bulk-process 1000s of URLs into clean text, then summarize with GPT-4/Claude. Token cost drops ~90% vs raw HTML.

📊

News Monitoring

Extract clean text from news URLs for brand monitoring, competitive intelligence, sentiment analysis.

📦

Dataset Building for ML

Build training datasets from web articles. Clean text + metadata (author, date, site) for fine-tuning or classification.

📦

Newsletter Curation

Pull clean body text from shared links. Auto-summarize, auto-tag, auto-publish to your newsletter.

📞

Research Assistant Apps

Build apps that read articles for users. Integrate with Claude, GPT, Gemini to extract then answer questions about content.

OUTPUT FIELDS

Fields returned per article

Cleaned title

Clean body text (no ads/nav)

Byline / author

Published date

Main image URL

Language detected

Word count

Excerpt (first 200 chars)

Site name

Canonical URL

All inline images

Reading time estimate

HOW IT WORKS

Three steps to structured data

01

Pass URL(s)

Single URL or list of 1000s. Also works on RSS feed entries.

02

Extraction runs

Fetches page, applies Mozilla Readability, strips nav/ads/scripts, returns structured text.

03

Use in your stack

JSON output is ready for OpenAI embeddings, Claude, Gemini, or any vector DB (Pinecone, Weaviate, Qdrant).

COMPARISON

Why this actor vs alternatives

Feature This Actor FirecrawlDiffbot
Price per article $0.005 ~$0.006–$0.010 ~$0.010
Extraction algorithm Mozilla Readability Proprietary + LLM Proprietary ML
LLM-readable output Clean markdown + JSON Markdown JSON
Metadata (author/date) Yes Yes Yes
JavaScript-heavy sites Yes (headless browser) Yes Yes
Free tier Apify $5 trial credit 500/mo free Limited

FAQ

Frequently asked questions

What is Mozilla Readability?

Open-source algorithm Firefox Reader Mode uses to extract clean article content. Industry standard — works on 90% of editorial sites.

Does it handle paywalled articles?

It extracts whatever the page returns. For soft paywalls (teaser visible), you get the teaser. For hard paywalls, you get the login prompt.

What about JS-heavy sites?

Optional headless browser mode renders full JavaScript before extraction. Slightly slower, +~$0.002/article.

Why use this vs Firecrawl?

Pricing: ~25% cheaper at scale. For LLM/RAG pipelines the cost difference compounds fast at 10k+ URLs/month.

Can I batch 10,000 URLs?

Yes — no limit. Apify runs it in parallel with automatic retry. Typical throughput 500–1000 articles/min per actor instance.

START NOW

Clean article text, ready for your LLM

Cut LLM token costs 90% by feeding clean content, not raw HTML.

Extract Articles Free