Agent output
trajectory + final state
Code-based graders
Methods
exact_matchunit_tests state_checktool_calls
Grading logic
restored_throughput == 4.7true
tests.run("routing_L7")pass
hss.subscribers.has(imsi)true
+ fast, objective, reproducible
brittle to valid variations
Model-based graders
Methods
rubricopen_judgment referenceconsensus
Judgment
“Did the agent explain the fix clearly?”
●●●● 4 / 5
“Names a single root-cause link?”  yes
+ nuance, open-ended, freeform
non-deterministic, needs labels
Score
0.0 – 1.0