← All projects
Active · v0.1

Rigr

Agent evaluation framework. Define what your agent should do, freeze a baseline, and catch regressions before they hit production. Open source.

0
code changes in your agent
<1s
per test run
Apache 2.0
license
CLI
zero dependencies beyond Python

What it is

A CLI tool that tests whether your AI agent actually does what it is supposed to. You write test cases. Input and expected output. Rigr runs them against your agent and tells you what passed and what failed. Freeze a baseline. From that point on, every run compares against it. New failures get flagged. Fixed failures get marked resolved.

Why I built it

I have agents running production tasks. Every model swap, prompt change, or retrieval tweak can silently break them. Existing evaluation tools test chat quality. They tell you if the chatbot sounds natural. They do not tell you if your support agent still calculates refunds correctly after a model update. That is a different problem.

I needed to know whether my agents were getting better or worse. Not vibes. Evidence. So I built a protocol that catches regressions before customers do.

How it works

Define expectations
JSON schema for what your agent must output. Field-level constraints. No ambiguous "looks good to me."
Write test cases
Inputs with expected outputs. Version-controlled. Reviewable. The same cases run every time.
Freeze baselines
Lock known-good results. Every future run compares against them. Regressions caught, not discovered.
Generate audit reports
Per-field accuracy, changelog of what broke and what was fixed. Compliance-ready evidence.
$ pip install rigr
$ rigr init
$ rigr test --agent my_agent.py

═══ Rigr Eval Report ═══
  5/5 cases | 23/24 fields | 95.8%
  Baseline comparison: 0 new errors
  ✓ PASS

What it is not

Not an LLM eval tool. Not a chatbot quality scorer. Not a dashboard you have to learn. It is a single purpose thing: tell you whether your agent got better or worse since last time, with evidence.