Measuring whether your AI is getting better or worse

The tooling here became Rigr, and the instinct behind it grew into Lethe. This is the story of why I care about agent measurement at all.

Here's a question most teams shipping AI agents can't answer: is v2 actually better than v1? Not "does it feel better in the demo." Better, measurably, on the thing it's supposed to do. You changed the model, tweaked a prompt, swapped the retriever. Did anything regress? Usually nobody knows, because nobody froze a baseline to compare against.

The existing eval tools didn't help. They measure chat quality, whether the model sounds helpful, whether the prose is fluent. That's the wrong axis. I didn't care whether my agent sounded good. I cared whether it still calculated the refund correctly after I upgraded the model underneath it. Those are completely different questions, and the second one is the one that gets you paged at 2am.

What "measuring" actually means

The approach that worked was boring in the best way. Define what the agent must output as a structured schema. Write test cases, real inputs with expected outputs, version-controlled so they're reviewable in a PR. Run the agent against them and freeze the known-good results as a baseline. Every future run compares against that frozen baseline: new errors get flagged before deployment, and errors you previously fixed get tracked so they can't silently come back.

It's regression testing, just pointed at a non-deterministic system. The trick is treating the agent's behaviour as something you snapshot and diff, not something you eyeball.

The thing measurement taught me

Once I could measure between versions, I noticed a second problem that version-to-version testing couldn't catch: agents don't only regress when you change them. They regress within a single run. A long-horizon agent that's perfectly fine at step five is a different animal at step eighty. It has lost the thread of its objective, started contradicting earlier decisions, drifted into riskier behaviour.

Rigr catches regressions between versions. But measuring degradation inside one long run is a different benchmark entirely. That's the question that turned into Lethe, the thing I care most about right now. Same instinct, two time scales: most agent failures happen in places nobody is looking, and the fix starts with deciding to look.

The unglamorous truth of agent engineering: the hard part isn't making the agent work once. It's knowing, with evidence, that it still works. After the model swap, after the prompt change, after the hundredth step.

Rigr is open source on github. The drift benchmark lives here.

We spent months measuring whether our AI was getting better or worse

What "measuring" actually means

The thing measurement taught me