Huginn
Self-hosted web scraping, crawling, and extraction API. Stealth-first, open source, no cloud tier,
no per-page tax. Named for Odin's raven, the one that flies out and brings back what it finds.
343tests passing
8core operations
$0per page, forever
REST+ CLI + streaming
Why I built it
Huginn started life as BlackCrawl, a stripped-down Blackreach focused on structured extraction. I needed
to scrape mythology texts for my corpus work and Firecrawl wanted $0.005 per page. A 10,000 page crawl
is $50. Per crawl. I do this weekly. My electricity costs less than that.
The rename was a positioning call, not just a name. Blackreach is the agent you delegate to: "go find me
state space model papers from 2024." Huginn is the API you call: "scrape this product page, give me
structured JSON with price, availability, specs." Same Playwright stealth backend underneath. Two
different jobs.
How a request flows
client ── REST call, CLI, or NDJSON/SSE stream
│
▼
api layer ── fastapi. scrape / crawl / map / extract /
│ research / watch / batch / stream
▼
stealth browser pool ── shared with blackreach.
│ navigator.webdriver hidden, real input timing
├─ robots.txt respected ── async wrapper + cache over
│ the stdlib parser (which is garbage, see readme)
▼
extractors
├─ markdown / html / links / screenshots / metadata
├─ llm-guided templates ── 10 built-in schemas → json
└─ research mode ── multi-hop with chromadb memory
▼
output ── json, ndjson stream, or webhook on change
What it does
scrape
Any URL to Markdown, HTML, links, screenshots, metadata.
crawl
Whole sites recursively. Depth limits, dedup, robots.txt respect.
map
Site structure as a BFS graph, nodes and edges, sitemap-style URL lists.
extract
Structured data through LLM-guided templates. 10 built-in schemas.
watch
Page change detection with webhook notifications.
batch + stream
Hundreds of URLs concurrently, NDJSON or SSE in real time.
Part of the browser automation cluster, which is one of my two real moats alongside memory and
persistence. Blackreach is the agent, Huginn is the infrastructure you would actually deploy.
Open source on GitHub. The README keeps a running list of current pain points, because pretending a v1.2 has none would be weird.