← All posts Post 01

How Blackreach works and why I built it differently


This is the writeup on how it works and why it's built the way it is. For stats, the demo video, and install instructions, see the project page →

Every autonomous web agent I tried had the same problem. It worked on the demo site and fell apart on anything real. Cloudflare would catch it in seconds. JavaScript-rendered content was completely invisible to it. Rate limit responses came back as 200 OK with an error page in the body, and the agent would report success, save garbage, and move on. I'd find out hours later when I opened the file.

The specific task that pushed me to build this was downloading the full Linear A inscription corpus from sigla.phis.me. 847 inscriptions, each with photographs, sign annotations, and transliterations. No public API. Data scattered across hundreds of individual pages with JavaScript-rendered pagination. I tried three different frameworks. All of them died on page 2 when the pagination loaded dynamically. One of them saved 847 copies of the same 403 error page and reported 100% success. That's when I stopped looking for an existing solution. Blackreach is what I built instead.

The ReAct loop

At the core is a ReAct loop. The agent gets a task, reasons through what to do next, takes an action, observes what happened, and repeats. The loop isn't the interesting part. The observation is.

Thought: I need the inscription table on this page
Action: navigate("https://sigla.phis.me/")
Observation: Page loaded. Nav: [About, Database, Signs].
  Main: table, 847 rows, columns [ID, Site, Text, Image].
  Interactive: pagination controls, export button.
Thought: extract all rows and handle pagination
Action: extract_table(selector=".inscription-table", paginate=True)
...

Most agents just dump raw HTML into the context. A typical page runs 50k to 500k tokens of noise and the model gets buried in it. Blackreach uses a DOM walker instead. It pulls out the semantic structure: visible text, interactive elements, nav landmarks, ARIA roles. A 200k token page becomes a 2k token observation the LLM can actually work with.

The DOM walker also assigns numeric IDs to every interactive element. The model clicks [15], not a CSS selector it has to guess or a brittle XPath. If the page structure changes between visits, which it often does, the IDs update automatically. The model doesn't need to know CSS. It just needs to know what's on the page right now.

Raw HTML dumps regularly exceed the context window of most models. Even when they fit, the noise-to-signal ratio makes reliable reasoning nearly impossible. The same page as a DOM walker observation fits in a few hundred tokens. The model sees what a human would see sitting in front of a browser.

Stealth Playwright

Standard Playwright gets caught immediately. The tells are well documented: navigator.webdriver = true, missing browser extensions, CDP artifacts, headless viewport signatures, unnatural input timing. Any halfway serious anti-bot system checks for all of these at once.

But it goes deeper than the obvious ones. Real browsers have canvas fingerprints. They have WebGL renderer strings that match their reported GPU. They have plugins installed. Mouse movements follow curves with natural variance, not perfectly straight lines between coordinates. Keystroke timing varies the way human typing varies, not in perfectly consistent intervals.

Blackreach patches these at the browser level before any page loads. Mouse trajectories use bezier curves with jitter. Keystroke timing pulls from a distribution of real typing speeds. Viewport dimensions come from a lookup table of actual common screen sizes. WebGL strings match the reported user agent. JS injection clears the automation fingerprint before the first network request goes out.

Not undetectable. Nothing is. But it passes Cloudflare basic bot detection, most Akamai setups, and the IP-based rate limiters I ran into on academic databases. That's good enough for the research tasks I needed it for.

Session resume and memory

Research tasks take time. A full database download across hundreds of paginated pages can run for hours. Interruptions happen: the internet drops, the laptop sleeps, you need to stop and come back to it.

Blackreach auto-saves task state when interrupted. Every downloaded file, every visited URL, every decision point gets written to SQLite as it happens. When you resume, it picks up exactly where it left off. It knows what it already downloaded so it doesn't fetch the same file twice.

The cross-session memory goes further. Blackreach remembers what worked per domain. If it found that a particular site needs a 3-second wait before the pagination renders, that gets stored. Next session, it already knows. It gets better at the sites it visits repeatedly.

Why 2,904 tests

Autonomous agents fail silently. That's the thing that kept breaking my trust in them. The agent says it succeeded, the file is there, and you don't find out until you open it and it's a 403 error page saved as HTML.

Every test in the suite came from a real failure. Here's what a few of them actually represent:

A rate limit returns 200 OK with a JSON body that says "success": true but the data field is empty. A naive agent saves the empty response and marks the task complete. The pagination button exists in the DOM but is inside a hidden div that only becomes visible after a 1.5 second JavaScript delay. Navigate too fast and you're clicking nothing. The login wall doesn't appear on the first 10 pages. It triggers on page 11 based on session age, and only from non-residential IPs. Running from home it never shows. Running from a VPS it's on every page.

These aren't edge cases I invented to pad a test count. They're sites I actually tried to scrape. Every test is something the internet threw at Blackreach and it had to learn to handle. When it's running at 3am collecting data I need it to fail loud. 2,904 tests is what that takes.


Open source at gitlab.com/null.phnix/blackreach. v5.0.0-beta.1. Issues and PRs welcome. Full feature breakdown and demo on the project page.

← All posts Next: Multi-agent coordination →