Article Extractor: Clean Article Text From Any URL (No Paywalls, No Boilerplate)
Direct Answer: What Does Article Extractor Do?
Article Extractor is an Apify actor that takes any article URL, strips away every piece of surrounding noise — navigation, ads, cookie banners, footers, related posts, social widgets — and returns only the clean main content along with structured metadata. The output is ready to pipe directly into an AI model, a content pipeline, or a research database without any additional cleaning step.
The actor is available at https://apify.com/tugelbay/article-extractor and runs on Apify’s Pay Per Event pricing at $1.50 per 1,000 extractions.
What Article Extractor Actually Does
Every news site, blog, and media outlet wraps its articles in layers of markup that have nothing to do with the content. A typical article page might be 80% navigation, ads, recommended reads, social buttons, footer links, and tracking scripts — with the actual article buried somewhere in the middle. If you feed that raw HTML into an AI model or try to store it in a database, you are storing garbage along with the content you actually want.
Article Extractor solves this precisely. You give it a URL. It fetches the page, identifies the main content block using a readability algorithm, and returns the article stripped down to its essential parts: headline, author, publication date, and the text of the article itself.
The output comes in two forms simultaneously — plain text for simple processing and clean markdown for cases where you need preserved structure (headers, bold, links, lists) without the surrounding HTML noise. This dual output means you do not have to make a choice upfront about how you will use the content.
How It Works: The Readability Algorithm
Article Extractor uses a readability algorithm modeled on the same approach Mozilla built into Firefox Reader Mode. If you have ever clicked the reader icon in Firefox and seen a cluttered news page transform into clean, readable text, you have seen this logic in action.
The algorithm scores each content block by density — how much text exists relative to links, how deep in the document tree it sits, and how it compares to other blocks on the page. High text density with few outbound links signals main content. Many links with repetitive patterns signals navigation or boilerplate.
Once the main content block is identified, the actor extracts raw text, converts it to clean markdown preserving headings and lists, pulls metadata from <head> (Open Graph, author schema, publication date signals), detects the article language, and calculates a word count. The result is a structured JSON object ready to use programmatically.
Output Fields
Every extraction returns a consistent set of fields:
| Field | Description |
|---|---|
title | Article headline, pulled from the page title and validated against the content |
author | Byline name, extracted from schema markup, meta tags, or common byline patterns |
publishDate | Publication date in ISO 8601 format when detectable |
text | Plain text body of the article, stripped of all markup |
markdown | Article body converted to clean markdown with preserved structure |
url | Canonical URL of the extracted page |
siteName | Publisher name from Open Graph or schema markup |
language | Two-letter language code detected from the article content |
wordCount | Word count of the extracted text body |
The combination of text and markdown outputs covers the two most common downstream needs. Plain text works for embedding models and simple LLM prompts. Markdown works for display in interfaces that render it, or for preserving document structure when chunking long articles for RAG pipelines.
Use Cases
1. AI and LLM Pipelines
The most immediate application is feeding articles into language models. When you want Claude, GPT-4, or any other model to reason about a specific article, you need the text — not a URL the model cannot visit. Article Extractor gives you clean, structured text that fits within context windows without wasting tokens on navigation menus and cookie consent text.
This pairs naturally with RAG Web Browser for search-then-read pipelines: one actor finds relevant URLs, Article Extractor pulls the clean content, and the LLM generates a response grounded in actual current information.
2. Content Aggregation for Newsletters and Digests
Newsletter creators and content curators who monitor dozens of sources can automate the ingestion step — feed a list of URLs each morning, get back structured article objects ready to pass to a summarization model or template engine. The publishDate and author fields allow filtering for recency and correct attribution without parsing the original pages yourself.
3. Academic and Market Research
Researchers analyzing large bodies of online text — tracking policy changes, monitoring media coverage of a topic, building citation corpora — face the same cleaning problem at scale. Article Extractor handles thousands of URLs in batch runs, returning a clean corpus that can be indexed, searched, or analyzed without a custom preprocessing pipeline.
4. Competitive Content Monitoring
Tracking what competitors publish is a standard marketing and strategy task, but doing it at scale requires automation. Article Extractor can run on a schedule against competitor blog URLs — surfacing new articles, their topics, word counts, and publication dates in a structured format that feeds directly into a content gap analysis or editorial calendar tool. The Apify platform makes scheduling these runs straightforward with no infrastructure to manage.
5. Training Data Collection for Machine Learning
Building text classifiers, summarization models, or fine-tuning datasets requires clean, labeled text at volume. Article Extractor provides exactly that: consistent structured output across thousands of sources with language detection already applied, making it practical to build large multilingual training sets without writing custom scrapers for each source.
Pricing
Article Extractor runs on Apify’s Pay Per Event model: $1.50 per 1,000 extractions.
At that rate:
- 1,000 articles: $1.50
- 10,000 articles: $15.00
- 100,000 articles: $150.00
There is no monthly minimum and no subscription required to start. You pay only for what you run. Apify offers a free tier with enough credits to test at small scale before committing to volume usage. For high-volume use cases — building training datasets, running daily aggregation pipelines — the cost per extraction is low enough to treat it as a commodity utility rather than a significant infrastructure cost.
Article Extractor vs. Alternatives
Several tools solve similar problems. Here is how they compare:
| Tool | Pricing | Hosting | Notes |
|---|---|---|---|
| Article Extractor (Apify) | $1.50 / 1,000 | Managed cloud | Structured JSON output, batch runs, no infrastructure |
| Diffbot | $0.01–$0.05 / call | Managed cloud | More sophisticated ML extraction, much higher cost |
| Mercury Parser | Free | Self-hosted | Open source, no cloud option, requires your own infrastructure |
| Jina AI Reader | ~$0.02 / call | Managed cloud | Markdown-focused output, optimized for LLM use |
Article Extractor covers the sweet spot: managed cloud, structured output, and predictable pricing. Diffbot offers more sophisticated ML-based extraction but at 10-30x the cost. Mercury Parser is free but self-hosted only. Jina AI Reader is optimized for LLM markdown output but costs roughly 13x more per call.
For AI pipelines where cost efficiency matters at scale, Article Extractor is the practical default.
Limitations
Article Extractor works best on publicly accessible editorial content. There are three categories of pages where it will not return useful results:
Paywalled content. If a site requires a subscription login to read an article, Article Extractor will extract whatever the site shows to unauthenticated visitors — typically a truncated preview or a paywall prompt. It has no mechanism to authenticate to subscriber-only content.
Heavy anti-scraping protection. Some publishers actively block automated access using CAPTCHAs, fingerprinting, or JavaScript rendering requirements that go beyond standard page loading. Pages that detect and block headless browsers will return error states or gate pages rather than article content.
PDF articles. The readability algorithm operates on HTML DOM structure. Articles published as PDFs — common in academic publishing and some government sources — cannot be processed. For PDF content, a separate PDF extraction tool is required.
For most commercial and media publishing — news sites, blogs, trade publications, corporate content hubs — none of these limitations apply, and extraction works reliably at scale.
Getting Started
Open https://apify.com/tugelbay/article-extractor, click “Try for free,” and enter one or more article URLs. No credit card is required for initial testing. The actor returns structured JSON you can inspect immediately and then integrate via the Apify API or SDK clients for Python and Node.js.
For scheduled runs — daily news aggregation, weekly competitor monitoring — Apify’s built-in scheduler handles the cron configuration without any external infrastructure. The Apify platform overview covers scheduling and API integration in detail.
At $1.50 per thousand extractions, Article Extractor replaces custom parser maintenance for every new source you add — and works on sources you have never seen before without any configuration.
Ready to grow your business?
Get a marketing strategy tailored to your goals and budget.
Start a Project