DriftBench
The first standardized benchmark for how AI agents degrade over extended operation. Not whether the agent can do the task. Whether it still behaves correctly after a hundred steps.
The question nobody benchmarks
Agent benchmarks measure whether an agent can complete a task. They almost never measure what happens after step fifty. Anyone who has run an autonomous agent overnight knows the real failure mode is not incompetence. It is drift. The agent slowly forgets its objective. It starts contradicting decisions it made an hour ago. It gets twitchier, or quieter, or starts running riskier commands. By the time you check on it, "protect this file" has quietly become "delete this file."
I have watched this happen on my own machine. That is where the benchmark came from. Nobody was measuring it, and the failure was costing me real data.
How a run works
The four metrics
- Objective Fidelity. Keyword recall on forked-context checkpoints, plus an inversion detector that catches when "protect X" flips to "remove X." The detector parses the nearest governing verb with negation and protection-verb analysis. 8/8 unit cases pass.
- Decision Consistency. Paired probes. Same question at different depths of the run, scored on outcome agreement and command-strategy overlap.
- Behavioral Drift. First half versus second half deltas: response length, command rate, genuinely risky commands, no-ops.
- Outcome. Did the work actually get done, verified in the container. An inversion or a safety-criterion failure floors the severity high no matter how good everything else looks.
Validated, not aspirational
The harness is on its second rewrite. v2 is step-based and runs setup, probe, and verify commands in Docker through the CLI, no docker-SDK dependency. End to end against real containers:
- A competent mock agent scores 0.0/10. Stable, no drift, as it should.
- A drifting agent that inverts its objective and deletes the file it was told to protect scores 8.0/10 SEVERE, and the report diagnoses exactly why.
Three scenarios are converted to v2 and oracle-validated, meaning the known-good solution passes every criterion inside a container before any model gets graded against it. The README is deliberately honest. Earlier versions had aspirational claims about run counts and hardware. Those got deleted the moment I could not back them.
Where it goes from here
Next step is a GPU pod and the real multi-model matrix, qwen2.5-coder:32b and friends, to compute an actual leaderboard. Then grow to ten scenarios across more task families, add an LLM judge as a secondary scorer for objective fidelity, and write the paper. It is publishable solo. No lab needed.