Honest results from a small GRPO lab

This is not a DeepSeek-R1 reproduction. It is an inspectable training and eval harness for RLVR and GRPO experiments, built to run cheap local smoke tests and then scale the same workflow to rented GPUs. The full report and sample gallery are on GitHub.

Reinforcement learning from verifiable rewards is the idea behind a lot of the recent reasoning-model progress. On math, you do not need a reward model guessing whether an answer is good. You can check it. The answer is right or it is not. That makes math a clean place to actually study what the training does, because the reward is not a vibe, it is a unit test.

I did not set out to beat anything. I set out to build a harness I could trust, run small models through it, and read every sample by hand. The interesting findings came out of one specific problem: the answer boundary.

The answer-boundary problem

When you grade a reasoning model on GSM8K, you have to extract its final answer from a wall of reasoning. If the model writes the right number but then keeps talking, or buries it mid-paragraph, a strict grader marks it wrong even though the model knew the answer. That is not a reasoning failure. It is a formatting failure, and the two get tangled together if you are not careful about how you measure.

So I tracked three things separately on every run: exact correctness, whether the final line was clean and parseable, and whether there was trailing text after the answer. Pulling those apart is what made the rest of the experiment readable.

base 3B (strict prompt) │ ├─ rationale SFT ──▶ GRPO variants │ └─ stop-aware eval ──▶ boundary SFT │ teach the model to stop cleanly after the answer ▼ v4: source-final-line filter ──▶ promoted 3B branch │ ▼ transfer the same recipe to 7B ├─ 7B adapter ── rejected (see below) └─ 7B base ── 512 + full GSM8K eval ──▶ failure taxonomy

What worked on 3B

The 3B model needed help with answer boundaries. Left alone it knew a lot of the answers but lost credit by not stopping cleanly. The promoted fix was boundary self-distillation: a source-final-line SFT pass that taught it to end after the answer. On the 512-example check it scored 429/512 exact, 361/512 strict final line, and 0/512 trailing text. That last number is the one I was chasing. Zero trailing text means the model stops where it should every single time.

What got rejected on 7B

The obvious next move is to take the recipe that helped the 3B and apply it to the 7B. It did not work, and I kept the result instead of quietly dropping it.

The 7B base model did not have the boundary problem in the first place. On the full GSM8K test split it scored 1164/1319 exact, 1296/1319 strict final line, and 0/1319 trailing text with no boundary SFT at all. It already stops cleanly. So applying the 3B fix to it was solving a problem the model did not have. After a tolerant rescore the adapter actually came out slightly behind the base model on exact accuracy. A paired bootstrap confirmed the exact delta was negative.

So the honest write-up says: the boundary recipe is a real, stable win on 3B, and it is the wrong tool for 7B. Reporting "it worked on 3B" without "it regressed on 7B" would have been the easier story and the false one. The negative result is in the repo with the bootstrap numbers next to it.

Where the 7B errors actually live

If the 7B is not losing points on formatting, where is it losing them? I built a failure taxonomy over the wrong answers to find out. Of the 155 full-test examples it got wrong, 149 still had a clean, parseable final-answer line. The model is not fumbling the format. It is getting the math wrong. Those are genuine reasoning errors, not extraction artifacts, and that distinction changes what you would do next. You do not fix reasoning errors with a formatting pass.

Why I build it this way

Verifiable rewards keep you honest. The grader is a real check, not a model's opinion, so you can argue with the number.
Separate the failure modes. Exact, final-line, and trailing-text are three different things. Collapsing them hides exactly the signal you need.
Keep the negative results. The rejected 7B adapter is more useful than the promoted 3B branch, because it tells you when the recipe stops applying.
Bootstrap your deltas. A small accuracy difference on 512 examples can be noise. Paired bootstrap tells you whether to trust it.

The point of the lab is not a leaderboard score. It is a workflow I can run cheaply, read end to end, and scale to bigger models without changing how I reason about the results. Evidence over claims, with the evidence committed next to the claim.

Full report, results JSON, and the representative sample gallery on GitHub. This thread connects to DriftBench, which applies the same evidence-first instinct to agent reliability.