AI & DATA
Crawl any site,
feed the clean text to your LLM
Point at any URL. The crawler discovers all internal pages, fetches each, strips HTML noise, returns clean content. Perfect for building RAG knowledge bases from entire sites.
$0.003
Per Page Crawled
Any Depth
Full Site or Subsection
Markdown
Clean LLM-Ready Output
JS
Rendered By Default
USE CASES
What you can do with this data
Build RAG from Docs Sites
Point at docs.yourproduct.com — get every doc page as clean markdown. Index in vector DB. Your chatbot now knows the product.
Content Migration
Migrating CMS? Crawl the old site, extract clean content + structure, re-import into the new system with metadata preserved.
Competitor Content Audit
Crawl competitor site, extract every article + page. Analyze what they cover, find gaps, clone content structure patterns.
Internal KB Ingestion
For AI agents: crawl your wiki, SharePoint, Notion — turn into flat markdown, feed to vector DB for semantic search.
Site Quality Audit
Crawl own site, score each page (word count, internal links, image count, canonical set). Find thin-content pages for pruning.
LLM Agent Memory
For agentic apps: let the agent crawl a website once and remember all its content. Faster answers on follow-up queries about that site.
OUTPUT FIELDS
Data extracted per page
Clean markdown body
Page title
Meta description
Canonical URL
H1–H6 heading hierarchy
Internal links list
External links list
Image URLs + alt text
Open Graph metadata
JSON-LD structured data
Published / modified dates
Word count
HOW IT WORKS
Three steps to structured data
Set start URL + depth
Pass root URL, max pages, and optional include/exclude URL patterns.
Crawl
Actor discovers links, fetches pages in parallel, renders JavaScript, extracts clean content.
Consume
Bulk JSON export for RAG ingestion, or webhook stream for real-time pipelines.
COMPARISON
Why this actor vs alternatives
| Feature | This Actor | Firecrawl | Apify official Web Scraper |
|---|---|---|---|
| Price per page | $0.003 | ~$0.004 | $0.005+ |
| Clean content extraction | Yes (built-in) | Yes | You write parser |
| JavaScript rendering | Yes by default | Yes | Optional |
| Recursive crawl | Yes, any depth | Yes | Yes |
| Include/exclude patterns | Glob + regex | Glob | Custom code |
| LLM output format | Markdown + JSON | Markdown | Raw HTML by default |
FAQ
Frequently asked questions
How deep can it crawl?
No hard limit. Typical usage: 500–10,000 pages per run. For a full 100k-page site, runs take a few hours.
Does it respect robots.txt?
Yes by default — add config flag to ignore if needed for owned sites. Always confirm you have rights to crawl the target.
What about JavaScript SPAs (React, Vue)?
Fully supported — the actor uses a headless browser that renders JS before extraction.
Why use this vs Firecrawl?
About 25% cheaper at scale. For large sites (10k+ pages) the cost difference becomes meaningful. Same LLM-ready output quality.
Can I crawl only a subsection?
Yes — pass URL patterns (include /docs/* only, exclude /blog/*). Glob and regex supported.
START NOW
Turn any website into an LLM knowledge base
Crawl, clean, index — one command. Build your RAG knowledge from any public site.
Crawl a Site Free →