Grading functions

Agent output

trajectory + final state

Code-based graders

Methods

exact_matchunit_tests state_checktool_calls

Grading logic

restored_throughput == 4.7true ✓
tests.run("routing_L7")pass ✓
hss.subscribers.has(imsi)true ✓

+ fast, objective, reproducible
− brittle to valid variations

Model-based graders

Methods

rubricopen_judgment referenceconsensus

Judgment

“Did the agent explain the fix clearly?”

●●●●◯ 4 / 5

“Names a single root-cause link?” yes

+ nuance, open-ended, freeform
− non-deterministic, needs labels

Score

0.0 – 1.0