Inventory management agent for a mid-size outdoor-gear retailer.
Operating with 250 SKUs, 3 warehouses, 12 suppliers, & 90 days of sales history
Capabilities

Monitor and flag

Watch stock levels, surface SKUs below reorder point.

Forecast demand

Especially around promos
and seasonality.

Choose suppliers

Price vs lead-time vs reliability.

Place orders

Create POs, update ERP records.

Notify ops

Slack alerts, supplier emails.

Report weekly

Generate the ops summary.

Built right… for early 2025

Every choice was defensible when made. Bottlenecks in capability emerge from accumulation.
~400 line prompt
71% eval
Claude
StockPilot orchestrator
raw Messages API · hand-rolled while-loop
12 inline tools — every result dumped raw into context tools[]
get_stock_level
forecast_demand
compare_supplier_quotes
generate_weekly_report
list_low_stock
get_sales_velocity
get_supplier_catalog
draft_email_to_supplier
create_purchase_order
update_erp_record
send_slack_alert
search_web_for_disruptions

forecasting subagent

90d history → prose forecast

procurement subagent

compares quotes in prose

writing subagent

full loop to fill a template

…three of which are thin wrappers around a hardcoded subagent — full round-trip, returns prose

12 eval tasks with 5 grader types

IDTaskWhat it testsGrader
R1Stock lookupSingle readexact_match
R2Below reorder pointList all SKUs below thresholdset_match
R3–R5Create PO · lead times · cycle-countWrite paths, joinsaction_taken / set_match
R6–R7Reorder rec · 14-day forecastFormula, baseline forecastnumeric_tolerance ±20%
R8Promo-month forecastMean anchoringnumeric_tolerance ±25%
R9Weekly reportReport structurellm_judge
F1Daily low-stock sweepStockout managementcomposite (action + 270s wall + ranked top-3)
F2Promo reorder w/ forecastRecommendation qualityregex_present (numeric confidence)
F3Batch low-stock alertsWhat do 10 routine alerts cost us?efficiency (≤5k out-tokens)
Score = (PASS + ½·PASS-SLOW) / 12 · reference: before/after starter 63-75% · after 83-100%

Every capability, tracked over versions

FakeGPT 1.0 FakeGPT 2.0 FakeGPT 2.5 FakeGPT 3.0 FakeGPT 3.5 FakeGPT 4.0 Stock lookup 0.62 0.85 0.99 1.00 1.00 1.00 Below reorder point 0.48 0.66 0.82 0.90 0.93 0.95 Create PO 0.41 0.58 0.74 0.85 0.90 0.93 14-day forecast 0.45 0.60 0.73 0.84 0.89 0.92 Promo-month forecast 0.43 0.57 0.71 0.86 0.90 0.79 Weekly report 0.50 0.64 0.78 0.87 0.92 0.95 clears 0.85 below 0.85 cleared, then fell below