Single-turn evals

A single-turn evaluation: a prompt and data go into an LLM, which returns a response that grading logic checks.
  • Give an AI an input, then apply grading logic to its output to measure success.
  • For earlier LLMs, single-turn, non-agentic evals were the main evaluation method.

Agentic evals

An agentic evaluation: tools, environment, and task feed an agent that acts in a loop, calling tools and updating the environment, until it finishes; grading logic then runs tests to score the result.
  • An agent uses tools across many turns, modifying state as it goes; mistakes can propagate and compound.
  • Models can find creative solutions that surpass static evals, so we grade the outcome, not just the steps.