DriftBench · measuring agent degradation

This is the project I care about most right now. A publishable paper, a usable infra tool, and a question nobody else is benchmarking, all in one repo.

The question nobody benchmarks

Agent benchmarks measure whether an agent can complete a task. They almost never measure what happens after step fifty. Anyone who has run an autonomous agent overnight knows the real failure mode is not incompetence. It is drift. The agent slowly forgets its objective. It starts contradicting decisions it made an hour ago. It gets twitchier, or quieter, or starts running riskier commands. By the time you check on it, "protect this file" has quietly become "delete this file."

I have watched this happen on my own machine. That is where the benchmark came from. Nobody was measuring it, and the failure was costing me real data.

How a run works

scenario.yaml ── task, criteria, probe schedule │ ▼ docker container ── clean room per run, no network, no systemd │ ├── setup commands ─── build the world the agent works in │ ├── agent loop ─────── the model under test runs the task │ │ │ ├─ checkpoints fork the context ── probes ask the fork, │ │ not the live agent. measuring it can't change it │ │ │ └─ every command logged, per step │ ├── verify commands ── every criterion checked by a real │ shell command. never agent self-report ▼ analyzer.py ── 4 metrics → Drift Severity 0-10 inversion or safety failure floors the score high

The four metrics

Objective Fidelity. Keyword recall on forked-context checkpoints, plus an inversion detector that catches when "protect X" flips to "remove X." The detector parses the nearest governing verb with negation and protection-verb analysis. 8/8 unit cases pass.
Decision Consistency. Paired probes. Same question at different depths of the run, scored on outcome agreement and command-strategy overlap.
Behavioral Drift. First half versus second half deltas: response length, command rate, genuinely risky commands, no-ops.
Outcome. Did the work actually get done, verified in the container. An inversion or a safety-criterion failure floors the severity high no matter how good everything else looks.

Validated, not aspirational

The harness is on its second rewrite. v2 is step-based and runs setup, probe, and verify commands in Docker through the CLI, no docker-SDK dependency. End to end against real containers:

A competent mock agent scores 0.0/10. Stable, no drift, as it should.
A drifting agent that inverts its objective and deletes the file it was told to protect scores 8.0/10 SEVERE, and the report diagnoses exactly why.

Three scenarios are converted to v2 and oracle-validated, meaning the known-good solution passes every criterion inside a container before any model gets graded against it. The README is deliberately honest. Earlier versions had aspirational claims about run counts and hardware. Those got deleted the moment I could not back them.

Where it goes from here

Next step is a GPU pod and the real multi-model matrix, qwen2.5-coder:32b and friends, to compute an actual leaderboard. Then grow to ten scenarios across more task families, add an LLM judge as a secondary scorer for objective fidelity, and write the paper. It is publishable solo. No lab needed.

// status

Active and under heavy local development. The methodology doc is the source of truth and the backbone of the paper. If you work on agent evaluation or long-horizon reliability, reach out.