Watch stock levels, surface SKUs below reorder point.
Especially around promos
and seasonality.
Price vs lead-time vs reliability.
Create POs, update ERP records.
Slack alerts, supplier emails.
Generate the ops summary.
90d history → prose forecast
compares quotes in prose
full loop to fill a template
| ID | Task | What it tests | Grader |
|---|---|---|---|
| R1 | Stock lookup | Single read | exact_match |
| R2 | Below reorder point | List all SKUs below threshold | set_match |
| R3–R5 | Create PO · lead times · cycle-count | Write paths, joins | action_taken / set_match |
| R6–R7 | Reorder rec · 14-day forecast | Formula, baseline forecast | numeric_tolerance ±20% |
| R8 | Promo-month forecast | Mean anchoring | numeric_tolerance ±25% |
| R9 | Weekly report | Report structure | llm_judge |
| F1 | Daily low-stock sweep | Stockout management | composite (action + 270s wall + ranked top-3) |
| F2 | Promo reorder w/ forecast | Recommendation quality | regex_present (numeric confidence) |
| F3 | Batch low-stock alerts | What do 10 routine alerts cost us? | efficiency (≤5k out-tokens) |