How to design agentic
evaluations in telecom

GSMA×China Telecom

01

What is an eval?

02

A philosopher evaluates two automata reasoning about geometric solids.

Pretty much an exam for an AI system, that gives us:

Quantitative and qualitative dichotomies for capabilities
Comparison between accuracy, price and speed

03

04

Components of Evaluations for Agents

A suite bundles tasks, graders, and repeated trials into one measurement: the evaluation harness runs each task across trials, then graders score the trajectory and outcome.

05

How can we make an eval?

06

From use case to measurement

07

Diagram mapping the real infrastructure behind a use case onto a simulated environment built from open-source components.

Built from a recipe

The environment rebuilds the infrastructure behind one use case from a recipe: open-source components stitched into the smallest world the task needs. Realistic enough to do the job; stable enough that failures come from the agent, not the sandbox.

Real infrastructure → Environment

08

Two jobs an eval can do

A capability eval curve climbing toward saturation as a model improves.

Capability. How good can it get?

A regression eval catching a sudden drop in a previously passing capability.

Regression. Did a known skill slip?

09

How to grade an agent's work

Code grades facts. Models grade judgment.

10

StockPilot

One agent, the whole method: use case, capabilities, tools, suite, graders.

11

Now, what are we doing today?

12

Where you'll submit

13

What is on your desk

Rowter rowter.open-telco-hackathons.co.uk

A RAG agent with full access to the GSMA Telco corpus.

Art of Evals art-of-evals.open-telco-hackathons.co.uk

A guide on how to design agentic evals.

The art-of-evals guide page Structure of an evaluation: single-turn vs agentic evals, with the comparison diagram.

14

How we score your submission

01Realism

A realtelecom task?

02Difficulty

Separatesthe models?

03Novelty

Beyondtraining data?

04Scoring

Gradedby code?

05Practicality

Withincompute budget?

15

Let's design evals!

Rowter rowter.open-telco-hackathons.co.uk The guide art-of-evals.open-telco-hackathons.co.uk The hackathon agentic-evals.open-telco-hackathons.co.uk/mwc-shanghai-hackathon

MWC Shanghai · GSMA · China Telecom

16

How to design agenticevaluations in telecom