An autonomous agent loop engine. Runs Claude Code sessions iteratively against real completion criteria until a task is either done or needs a human. Solves the "always in beta" problem.
AI coding tools are good at generating code. They're bad at knowing when they're done. You give an agent a task, it writes something, reports success, and you come back to find the tests still fail or the feature is half-implemented. The agent didn't lie. It just had no real way to verify its own work.
The standard approach is to keep a human in the loop at every step. That works, but it means you can't step away. You're not using an agent. You're supervising a fast typist.
Orchestrator wraps Claude Code in a loop. Instead of running once and reporting, it runs, checks real completion criteria, and runs again if they're not met. The criteria are defined upfront by you. The agent keeps going until they pass.
task = Task(
goal="Add authentication to the API",
criteria=[
TestCriteria("pytest tests/test_auth.py"),
FileCriteria("src/auth.py", exists=True),
GrepCriteria("src/main.py", pattern="require_auth"),
],
max_iterations=12,
escalate_on_failure=True
)
Each iteration: run Claude Code with the task and current state, observe the result, evaluate all criteria, stop if they pass, escalate if max iterations hit, repeat otherwise. That's the whole loop.
Three types of criteria cover most real tasks:
Every task ends in one of two states. Done means all criteria passed. The output, diff, and a completion log get written to disk and the session closes.
Escalate means the task hit its iteration limit without passing. Orchestrator surfaces the current state, the last Claude output, and which criteria failed. You pick up from there. The agent did what it could. Now it's a human problem.
Right now Orchestrator handles anything that has a clear test surface: adding features to Blackreach, building out the RAG pipeline for the mythology corpus, refactoring tasks that need to stay green on existing tests. Anything where I can write "done when X" and X is checkable.
The things it can't do yet are tasks with no clear success signal. "Make this code better" doesn't work. "Make this code pass these 40 tests" does. That's the constraint. It's also the discipline.