Rigr · agent evaluation & regression testing

pip install rigr
rigr init && rigr test

The problem

You have agents in production. Every model update, prompt change, or retrieval tweak can break them silently. The existing eval tools test whether the model sounds good. That is the wrong axis. I do not care if my agent sounds helpful. I care whether it still calculates the refund correctly after the model swap. Different question, and the one that pages you at 2am.

How it works

rigr init ──── rigr.yaml + test_cases/ │ structured schema for what the agent must output ▼ rigr test ──── run real inputs against your agent │ expected outputs live in version control, │ reviewable in a PR like any other code ▼ rigr freeze ── lock the known-good results as a baseline ▼ rigr compare ─ every future run diffs against the freeze ├─ new errors → flagged before deploy ├─ fixed errors → tracked so they can't sneak back └─ per-field accuracy + changelog → audit-ready report

It is regression testing pointed at a non-deterministic system. The trick is treating agent behavior as something you snapshot and diff instead of something you eyeball in a demo.

Rigr is the sibling of Lethe. Rigr catches regressions between versions. Lethe measures degradation inside a single long run. Same instinct, two time scales: agents fail where nobody is measuring.

Open source on github. Backstory: measuring whether your AI is getting better or worse →