AI & DATA

Crawl any site,
feed the clean text to your LLM

Point at any URL. The crawler discovers all internal pages, fetches each, strips HTML noise, returns clean content. Perfect for building RAG knowledge bases from entire sites.

Try It Free on Apify → How It Works

$0.003

Per Page Crawled

Any Depth

Full Site or Subsection

Markdown

Clean LLM-Ready Output

Rendered By Default

USE CASES

What you can do with this data

🧬

Build RAG from Docs Sites

Point at docs.yourproduct.com — get every doc page as clean markdown. Index in vector DB. Your chatbot now knows the product.

📥

Content Migration

Migrating CMS? Crawl the old site, extract clean content + structure, re-import into the new system with metadata preserved.

🔍

Competitor Content Audit

Crawl competitor site, extract every article + page. Analyze what they cover, find gaps, clone content structure patterns.

📝

Internal KB Ingestion

For AI agents: crawl your wiki, SharePoint, Notion — turn into flat markdown, feed to vector DB for semantic search.

📈

Site Quality Audit

Crawl own site, score each page (word count, internal links, image count, canonical set). Find thin-content pages for pruning.

🤖

LLM Agent Memory

For agentic apps: let the agent crawl a website once and remember all its content. Faster answers on follow-up queries about that site.

OUTPUT FIELDS

Data extracted per page

Clean markdown body

Page title

Meta description

Canonical URL

H1–H6 heading hierarchy

Internal links list

External links list

Image URLs + alt text

Open Graph metadata

JSON-LD structured data

Published / modified dates

Word count

HOW IT WORKS

Three steps to structured data

Set start URL + depth

Pass root URL, max pages, and optional include/exclude URL patterns.

Crawl

Actor discovers links, fetches pages in parallel, renders JavaScript, extracts clean content.

Consume

Bulk JSON export for RAG ingestion, or webhook stream for real-time pipelines.

COMPARISON

Why this actor vs alternatives

Feature	This Actor	Firecrawl	Apify official Web Scraper
Price per page	$0.003	~$0.004	$0.005+
Clean content extraction	Yes (built-in)	Yes	You write parser
JavaScript rendering	Yes by default	Yes	Optional
Recursive crawl	Yes, any depth	Yes	Yes
Include/exclude patterns	Glob + regex	Glob	Custom code
LLM output format	Markdown + JSON	Markdown	Raw HTML by default

FAQ

Frequently asked questions

How deep can it crawl?

No hard limit. Typical usage: 500–10,000 pages per run. For a full 100k-page site, runs take a few hours.

Does it respect robots.txt?

Yes by default — add config flag to ignore if needed for owned sites. Always confirm you have rights to crawl the target.

What about JavaScript SPAs (React, Vue)?

Fully supported — the actor uses a headless browser that renders JS before extraction.

Why use this vs Firecrawl?

About 25% cheaper at scale. For large sites (10k+ pages) the cost difference becomes meaningful. Same LLM-ready output quality.

Can I crawl only a subsection?

Yes — pass URL patterns (include /docs/* only, exclude /blog/*). Glob and regex supported.

START NOW

Turn any website into an LLM knowledge base

Crawl, clean, index — one command. Build your RAG knowledge from any public site.

Crawl a Site Free →

Crawl any site, feed the clean text to your LLM