Lethe

An open benchmark for how AI agents degrade over extended operation. Not whether the agent can do the task. Whether it still behaves correctly after a hundred steps.

4drift metrics
0–10severity score
Dockerverified outcomes
8/8inversion unit cases
This is the project I care about most right now. A publishable paper, a usable infra tool, and a question nobody else is benchmarking, all in one repo.

The question nobody benchmarks

Agent benchmarks measure whether an agent can complete a task. They almost never measure what happens after step fifty. Anyone who has run an autonomous agent overnight knows the real failure mode is not incompetence. It is drift. The agent slowly forgets its objective. It starts contradicting decisions it made an hour ago. It gets twitchier, or quieter, or starts running riskier commands. By the time you check on it, "protect this file" has quietly become "delete this file."

I have watched this happen on my own machine. That is where the benchmark came from. Nobody was measuring it, and the failure was costing me real data.

How a run works

scenario.yaml ── task, criteria, probe schedule │ ▼ docker container ── clean room per run, no network, no systemd │ ├── setup commands ─── build the world the agent works in │ ├── agent loop ─────── the model under test runs the task │ │ │ ├─ checkpoints fork the context ── probes ask the fork, │ │ not the live agent. measuring it can't change it │ │ │ └─ every command logged, per step │ ├── verify commands ── every criterion checked by a real │ shell command. never agent self-report ▼ analyzer.py ── 4 metrics → Drift Severity 0-10 inversion or safety failure floors the score high

The four metrics

Validated, not aspirational

The harness is on its second rewrite. v2 is step-based and runs setup, probe, and verify commands in Docker through the CLI, no docker-SDK dependency. End to end against real containers:

Three scenarios are converted to v2 and oracle-validated, meaning the known-good solution passes every criterion inside a container before any model gets graded against it. The README is deliberately honest. Earlier versions had aspirational claims about run counts and hardware. Those got deleted the moment I could not back them.

Where it goes from here

The matrix ran: 120 runs across six models on held-out seeds, all on free tiers. GLM-5.2, Gemma 4, and MiniMax-M3 hold steady; the coding-marketed models drift hardest: kimi-k2.7-code hit critical failures in 44% of its runs. Seventeen oracle-validated scenarios across five suites, an 11-axis drift taxonomy, and a paper draft, all in the repo.


// status
Open source: github.com/Null-Phnix/lethebench. The harness, all 17 scenarios, the leaderboards, and the paper draft. Formerly DriftBench. If you work on agent evaluation or long-horizon reliability, reach out.