Agent evaluation framework. Define what your agent should do, freeze a baseline, and catch regressions before they hit production. Open source.
A CLI tool that tests whether your AI agent actually does what it is supposed to. You write test cases. Input and expected output. Rigr runs them against your agent and tells you what passed and what failed. Freeze a baseline. From that point on, every run compares against it. New failures get flagged. Fixed failures get marked resolved.
I have agents running production tasks. Every model swap, prompt change, or retrieval tweak can silently break them. Existing evaluation tools test chat quality. They tell you if the chatbot sounds natural. They do not tell you if your support agent still calculates refunds correctly after a model update. That is a different problem.
I needed to know whether my agents were getting better or worse. Not vibes. Evidence. So I built a protocol that catches regressions before customers do.
$ pip install rigr $ rigr init $ rigr test --agent my_agent.py ═══ Rigr Eval Report ═══ 5/5 cases | 23/24 fields | 95.8% Baseline comparison: 0 new errors ✓ PASS
Not an LLM eval tool. Not a chatbot quality scorer. Not a dashboard you have to learn. It is a single purpose thing: tell you whether your agent got better or worse since last time, with evidence.