Frontier AI Went Bankrupt Betting on Football — What That Means for Your Agent Deployments
Frontier models failed KellyBench betting test. What this means for AI agents in construction, BIM, and cost forecasting — and your validation procedures.
Every model lost money. Some lost everything.
General Reasoning, a London AI startup, published the so-called KellyBench report this week — a test in which eight frontier models ran through a virtual simulation of the Premier League 2023–24 season. Each model started with a normalized bankroll of £100,000 and had three attempts to profit by betting on match results and goal totals. No model succeeded on average. Many went bankrupt.
This is no edge case. This is a control experiment for precisely the capability that AEC professionals today demand of their AI agents: sequential decision-making under uncertainty, over long time horizons, with shifting input data.
←TODAY: Frontier models fail systematically at dynamic, multi-month decision tasks — KellyBench 2026.
→3012: Autonomous construction agents manage procurement processes and schedules over years; validation standards for long horizons don’t yet exist.
Fulcrum: The gap between benchmark performance and real-world degradation is today’s most dangerous blind spot in agent deployment.
Results in detail: Anthropic Claude Opus 4.6 performed best — average ROI of –11.0%, best attempt –0.2%. OpenAI GPT-5.4 followed with an average loss of –13.6%. Google Gemini 3.1 Pro achieved +33.7% profit on its best attempt, but hit zero in another. xAI Grok 4.20 lost 100% all three times — once the model went bankrupt, the other attempts didn’t complete. According to the Ars Technica report and the original paper “systematically underperforming humans,” the authors’ verdict was clear.
The name KellyBench is no accident: it refers to the Kelly Criterion, the mathematically grounded method for optimal stake sizing in risk management — a standard every quant and experienced project controller knows. The models had no internet access; all data was fed directly. So the task wasn’t information gathering but probabilistic reasoning and risk adaptation over time. Both failed.
Ross Taylor, CEO of General Reasoning and former Meta AI researcher, names the core problem clearly to the Financial Times: “There is so much hype about AI automation, but there’s not a lot of measurement of putting AI into a longtime horizon setting.” Standard benchmarks like MMLU or HumanEval test static individual tasks — no sequential feedback, no data adaptation, no capital management over time.
Atelier: Anyone deploying AI agents for schedule planning, cost forecasting, or BIM-based clash detection across multi-month projects validates them today with static benchmarks — the same structural deficit that KellyBench exposes. The high variance across models (Gemini: +33.7% to –100% in the same model, three attempts) shows: individual pilot successes prove nothing about reliability in production.
The spread between models is no small thing: an 89-percentage-point difference in average ROI between Claude and Grok. Whoever treats frontier models as interchangeable for agent workflows carries real operational risk. The paper is not yet peer-reviewed — Taylor emphasizes this caveat — but the direction of the findings is consistent with what practitioners in agentic settings already observe: strong short-term performance, degradation over longer sequences.
The EU AI Act classifies AI systems for financial decision support as high-risk — with corresponding requirements for robustness and documentation. Anyone deploying AI agents today for procurement optimization or risk assessment in planning should examine whether their own validation procedures meet this standard. KellyBench is no judgment against AI agents generally — it’s a demand for honest measurement.
Bring the KellyBench paper to your next team meeting where you’re evaluating AI agents for long-horizon workflows. Ask concretely: Over what time horizon have we validated, and against what shifting input data?
Source: Ars Technica
SOURCE · ↗