Agent output
trajectory + final state
Code-based graders
Methods
exact_match
unit_tests
state_check
tool_calls
Grading logic
restored_throughput == 4.7
true
✓
tests.run("routing_L7")
pass
✓
hss.subscribers.has(imsi)
true
✓
+
fast, objective, reproducible
−
brittle to valid variations
Model-based graders
Methods
rubric
open_judgment
reference
consensus
Judgment
“Did the agent explain the fix clearly?”
●●●●
◯
4 / 5
“Names a single root-cause link?”
yes
+
nuance, open-ended, freeform
−
non-deterministic, needs labels
Score
0.0 – 1.0