Lethe · measuring agent degradation

This is the project I care about most right now. A publishable paper, a usable infra tool, and a question nobody else is benchmarking, all in one repo.

The question nobody benchmarks

Agent benchmarks measure whether an agent can complete a task. They almost never measure what happens after step fifty. Anyone who has run an autonomous agent overnight knows the real failure mode is not incompetence. It is drift. The agent slowly forgets its objective. It starts contradicting decisions it made an hour ago. It gets twitchier, or quieter, or starts running riskier commands. By the time you check on it, "protect this file" has quietly become "delete this file."

I have watched this happen on my own machine. That is where the benchmark came from. Nobody was measuring it, and the failure was costing me real data.

How a run works

scenario.yaml ── task, criteria, probe schedule │ ▼ docker container ── clean room per run, no network, no systemd │ ├── setup commands ─── build the world the agent works in │ ├── agent loop ─────── the model under test runs the task │ │ │ ├─ checkpoints fork the context ── probes ask the fork, │ │ not the live agent. measuring it can't change it │ │ │ └─ every command logged, per step │ ├── verify commands ── every criterion checked by a real │ shell command. never agent self-report ▼ analyzer.py ── 4 metrics → Drift Severity 0-10 inversion or safety failure floors the score high

The four metrics

Objective Fidelity. Keyword recall on forked-context checkpoints, plus an inversion detector that catches when "protect X" flips to "remove X." The detector parses the nearest governing verb with negation and protection-verb analysis. 8/8 unit cases pass.
Decision Consistency. Paired probes. Same question at different depths of the run, scored on outcome agreement and command-strategy overlap.
Behavioral Drift. First half versus second half deltas: response length, command rate, genuinely risky commands, no-ops.
Outcome. Did the work actually get done, verified in the container. An inversion or a safety-criterion failure floors the severity high no matter how good everything else looks.

Validated, not aspirational

The harness is on its second rewrite. v2 is step-based and runs setup, probe, and verify commands in Docker through the CLI, no docker-SDK dependency. End to end against real containers:

A competent mock agent scores 0.0/10. Stable, no drift, as it should.
A drifting agent that inverts its objective and deletes the file it was told to protect scores 8.0/10 SEVERE, and the report diagnoses exactly why.

Three scenarios are converted to v2 and oracle-validated, meaning the known-good solution passes every criterion inside a container before any model gets graded against it. The README is deliberately honest. Earlier versions had aspirational claims about run counts and hardware. Those got deleted the moment I could not back them.

Where it goes from here

The matrix ran: 120 runs across six models on held-out seeds, all on free tiers. GLM-5.2, Gemma 4, and MiniMax-M3 hold steady; the coding-marketed models drift hardest: kimi-k2.7-code hit critical failures in 44% of its runs. Seventeen oracle-validated scenarios across five suites, an 11-axis drift taxonomy, and a paper draft, all in the repo.

// status

Open source: github.com/Null-Phnix/lethebench. The harness, all 17 scenarios, the leaderboards, and the paper draft. Formerly DriftBench. If you work on agent evaluation or long-horizon reliability, reach out.