GSMA×China Telecom
Pretty much an exam for an AI system, that gives us:
The environment rebuilds the infrastructure behind one use case from a recipe: open-source components stitched into the smallest world the task needs. Realistic enough to do the job; stable enough that failures come from the agent, not the sandbox.
Capability. How good can it get?
Regression. Did a known skill slip?
Code grades facts. Models grade judgment.
One agent, the whole method: use case, capabilities, tools, suite, graders.
A realtelecom task?
Separatesthe models?
Beyondtraining data?
Gradedby code?
Withincompute budget?
MWC Shanghai · GSMA · China Telecom