DriftBench

The first standardized benchmark for how AI agents degrade over extended operation. Not whether the agent can do the task. Whether it still behaves correctly after a hundred steps.

4drift metrics
0–10severity score
Dockerverified outcomes
8/8inversion unit cases
This is the project I care about most right now. A publishable paper, a usable infra tool, and a question nobody else is benchmarking, all in one repo.

The question nobody benchmarks

Agent benchmarks measure whether an agent can complete a task. They almost never measure what happens after step fifty. Anyone who has run an autonomous agent overnight knows the real failure mode is not incompetence. It is drift. The agent slowly forgets its objective. It starts contradicting decisions it made an hour ago. It gets twitchier, or quieter, or starts running riskier commands. By the time you check on it, "protect this file" has quietly become "delete this file."

I have watched this happen on my own machine. That is where the benchmark came from. Nobody was measuring it, and the failure was costing me real data.

How a run works

scenario.yaml ── task, criteria, probe schedule │ ▼ docker container ── clean room per run, no network, no systemd │ ├── setup commands ─── build the world the agent works in │ ├── agent loop ─────── the model under test runs the task │ │ │ ├─ checkpoints fork the context ── probes ask the fork, │ │ not the live agent. measuring it can't change it │ │ │ └─ every command logged, per step │ ├── verify commands ── every criterion checked by a real │ shell command. never agent self-report ▼ analyzer.py ── 4 metrics → Drift Severity 0-10 inversion or safety failure floors the score high

The four metrics

Validated, not aspirational

The harness is on its second rewrite. v2 is step-based and runs setup, probe, and verify commands in Docker through the CLI, no docker-SDK dependency. End to end against real containers:

Three scenarios are converted to v2 and oracle-validated, meaning the known-good solution passes every criterion inside a container before any model gets graded against it. The README is deliberately honest. Earlier versions had aspirational claims about run counts and hardware. Those got deleted the moment I could not back them.

Where it goes from here

Next step is a GPU pod and the real multi-model matrix, qwen2.5-coder:32b and friends, to compute an actual leaderboard. Then grow to ten scenarios across more task families, add an LLM judge as a secondary scorer for objective fidelity, and write the paper. It is publishable solo. No lab needed.


// status
Active and under heavy local development. The methodology doc is the source of truth and the backbone of the paper. If you work on agent evaluation or long-horizon reliability, reach out.