Single-turn evals

A single-turn evaluation: a prompt and data go into an LLM, which returns a response that grading logic checks.

Give an AI an input, then apply grading logic to its output to measure success.
For earlier LLMs, single-turn, non-agentic evals were the main evaluation method.

Agentic evals

An agentic evaluation: tools, environment, and task feed an agent that acts in a loop, calling tools and updating the environment, until it finishes; grading logic then runs tests to score the result.

An agent uses tools across many turns, modifying state as it goes; mistakes can propagate and compound.
Models can find creative solutions that surpass static evals, so we grade the outcome, not just the steps.