Active · Internal

Orchestrator

An autonomous agent loop engine. Runs Claude Code sessions iteratively against real completion criteria until a task is either done or needs a human. Solves the "always in beta" problem.

criteria types

exit modes

local

no cloud

Claude

backend

The problem

AI coding tools are good at generating code. They're bad at knowing when they're done. You give an agent a task, it writes something, reports success, and you come back to find the tests still fail or the feature is half-implemented. The agent didn't lie. It just had no real way to verify its own work.

The standard approach is to keep a human in the loop at every step. That works, but it means you can't step away. You're not using an agent. You're supervising a fast typist.

How Orchestrator works

Orchestrator wraps Claude Code in a loop. Instead of running once and reporting, it runs, checks real completion criteria, and runs again if they're not met. The criteria are defined upfront by you. The agent keeps going until they pass.

task = Task(
    goal="Add authentication to the API",
    criteria=[
        TestCriteria("pytest tests/test_auth.py"),
        FileCriteria("src/auth.py", exists=True),
        GrepCriteria("src/main.py", pattern="require_auth"),
    ],
    max_iterations=12,
    escalate_on_failure=True
)

Each iteration: run Claude Code with the task and current state, observe the result, evaluate all criteria, stop if they pass, escalate if max iterations hit, repeat otherwise. That's the whole loop.

Completion criteria

Three types of criteria cover most real tasks:

Test criteria

Run a test command. Pass if exit code is 0. Covers unit tests, integration tests, any CLI check that returns a meaningful exit code.

File criteria

Check that a file exists, or exists and matches a specific hash or size. Useful for build artifacts, generated configs, output files.

Grep criteria

Check that a pattern appears (or doesn't appear) in a file. Good for confirming a specific function was added, a comment was removed, a config was set.

Composite criteria

AND/OR logic across any combination of the above. A task can require all criteria to pass, or any one of them, or a threshold.

Done or escalate

Every task ends in one of two states. Done means all criteria passed. The output, diff, and a completion log get written to disk and the session closes.

Escalate means the task hit its iteration limit without passing. Orchestrator surfaces the current state, the last Claude output, and which criteria failed. You pick up from there. The agent did what it could. Now it's a human problem.

The key insight is that "done" has to mean something concrete. Vague goals produce vague completions. Specific criteria produce verifiable results. The work upfront of defining criteria is what makes the autonomous part actually work.

What I use it for

Right now Orchestrator handles anything that has a clear test surface: adding features to Blackreach, building out the RAG pipeline for the mythology corpus, refactoring tasks that need to stay green on existing tests. Anything where I can write "done when X" and X is checkable.

The things it can't do yet are tasks with no clear success signal. "Make this code better" doesn't work. "Make this code pass these 40 tests" does. That's the constraint. It's also the discipline.

// internal

Orchestrator is internal tooling. Not on GitHub yet. If it reaches a point where it's useful outside my own setup I'll open source it. If you're building something similar and want to compare notes, reach out.