← All posts Post 07

We spent months measuring whether our AI was getting better or worse


Rigr is open source: github.com/Null-Phnix/rigr

The thing nobody tells you about agents

I have been building AI agents for a while now. Not the kind that answer trivia questions. The kind that actually do work. Trade crypto. Scrape documentation. Extract structured data from ancient texts.

Here is the thing nobody tells you. Agents break. Silently. You change the prompt by three words. You swap the model. You tweak the retrieval pipeline. And suddenly your agent that was nailing refund calculations last week is now confidently wrong about everything. And you have no way to know until a customer tells you.

Most teams handle this by looking at outputs and going "yeah that looks right." That is not testing. That is vibes.

What we learned from our own agent

We have a model that predicts structured outcomes from text inputs. Very niche. Very structured. The problem was we kept making changes. New training data. Tweaked loss functions. Different hyperparameters. And sometimes the model got worse on specific fields but we would not notice because the overall score looked fine.

So we built a protocol. Frozen baselines. Every time we run the model, we compare against a known good snapshot. If a field that was passing before suddenly fails, we catch it immediately. Not next week when we happen to re-read the outputs.

We also built audit packs. Extra test cases that the model never saw during development. Turns out our best model that scored 85 percent on validation dropped to 52 percent on the audit. The validation score was lying to us. The protocol caught it.

What Rigr actually does

It is dead simple. You write test cases. Input and expected output. Rigr runs them against your agent and tells you what passed and what failed. You freeze a baseline. From that point on, every run compares against it. New failures get flagged. Old failures that got fixed get marked resolved.

$ pip install rigr
$ rigr init
$ rigr test --agent my_agent.py

═══ Rigr Eval Report ═══
  5/5 cases | 23/24 fields | 95.8%
  ✓ PASS

Why open source

The agent infrastructure space is weird right now. Everyone is building the thing that helps you build agents. Nobody is building the thing that helps you know if your agent actually works.

Not an LLM eval tool. Not a chatbot quality scorer. Just a CLI that tells you whether your agent got better or worse since last time, with evidence.


Source: github.com/Null-Phnix/rigr. Project page: Rigr.

← All posts ← Previous: Claude voice