About
Why AgentDelta exists
The narrative around AI models leans blindly toward the latest and greatest, with little discernment and rarely any justification for the additional cost. When a model is called “better,” that judgment usually rests on a subjective, ungrounded perception rather than on objective metrics.
It is often not even clear whether a newer model is genuinely more capable or is simply doing more agentic work: more tool calls, longer workflows, more retries, more self-review. AgentDelta measures the delta under controlled, repeatable, auditable conditions, and reports cost, latency, and amplification alongside capability, the metrics that should actually drive an upgrade decision.
Source on GitHub →anthropic-hardest5-4x5
Anthropic (Claude Code)5 tasks · 100 runs · default★ Opus 4.899
success100%cost / win$0.24med. time64s
#2 Opus 4.798
success100%cost / win$0.27med. time70s
#3 Opus 4.695
success100%cost / win$0.37med. time132s
#4 Sonnet 4.690
success88%cost / win$0.20med. time89s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
anthropic-longcontext-4x5
Anthropic (Claude Code)3 tasks · 60 runs · default★ Opus 4.8100
success100%cost / win$0.45med. time128s
#2 Opus 4.799
success100%cost / win$0.51med. time133s
#3 Opus 4.675
success67%cost / win$0.81med. time150s
#4 Sonnet 4.665
success53%cost / win$1.05med. time182s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
anthropic-statemachine-4x10
Anthropic (Claude Code)5 tasks · 200 runs · default★ Opus 4.8100
success100%cost / win$0.22med. time56s
#2 Opus 4.799
success100%cost / win$0.23med. time62s
#3 Opus 4.696
success100%cost / win$0.32med. time107s
#4 Sonnet 4.693
success94%cost / win$0.21med. time105s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
anthropic-v0.1-concurrency
Anthropic (Claude Code)1 tasks · 12 runs · default★ Opus 4.799
success100%cost / win$0.22med. time70s
#2 Sonnet 4.699
success100%cost / win$0.19med. time97s
#3 Opus 4.898
success100%cost / win$0.28med. time73s
#4 Opus 4.697
success100%cost / win$0.30med. time91s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
anthropic-v0.1-eqbudget
Anthropic (Claude Code)2 tasks · 32 runs · default, equal_budget★ Opus 4.8100
success100%cost / win$0.25med. time61s
#2 Opus 4.797
success100%cost / win$0.32med. time91s
#3 Sonnet 4.697
success100%cost / win$0.28med. time133s
#4 Opus 4.695
success100%cost / win$0.42med. time129s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
anthropic-v0.1-frontier
Anthropic (Claude Code)3 tasks · 24 runs · default★ Opus 4.8100
success100%cost / win$0.24med. time55s
#2 Opus 4.699
success100%cost / win$0.22med. time75s
#3 Opus 4.798
success100%cost / win$0.28med. time79s
#4 Sonnet 4.697
success100%cost / win$0.23med. time101s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
anthropic-v0.1-hard
Anthropic (Claude Code)3 tasks · 24 runs · default★ Opus 4.8100
success100%cost / win$0.25med. time54s
#2 Opus 4.799
success100%cost / win$0.25med. time65s
#3 Sonnet 4.697
success100%cost / win$0.24med. time128s
#4 Opus 4.696
success100%cost / win$0.32med. time136s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
anthropic-v0.1-pilot
Anthropic (Claude Code)3 tasks · 24 runs · default★ Opus 4.898
success100%cost / win$0.22med. time52s
#2 Sonnet 4.698
success100%cost / win$0.15med. time82s
#3 Opus 4.798
success100%cost / win$0.21med. time59s
#4 Opus 4.695
success100%cost / win$0.29med. time109s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
deepseek-full50-2x5
DeepSeek (OpenCode)50 tasks · 500 runs · default★ DeepSeek V4 Flash98
success98%cost / win$0.00med. time85s
#2 DeepSeek V4 Pro95
success100%cost / win$0.01med. time100s
None of the 1 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
gemini-full50-4x1
Google (Gemini CLI)50 tasks · 200 runs · default★ Gemini 3.1 Flash Lite96
success96%cost / win$0.02med. time78s
#2 Gemini 2.5 Pro92
success96%cost / win$0.08med. time85s
#3 Gemini 3.5 Flash89
success96%cost / win$0.27med. time134s
#4 Gemini 2.5 Flash83
success78%cost / win$0.03med. time65s
3 of 3 newer-model gains are material (survive the Holm-corrected paired test).
kimi-full50-1x1
Moonshot Kimi (OpenCode)50 tasks · 50 runs · default★ Kimi K2.7 Code100
success100%cost / win$0.03med. time51s
No newer-over-older pairs evaluated in this suite.
openai-full50-4x1
OpenAI (Codex CLI)50 tasks · 200 runs · default★ GPT-5.495
success100%cost / win$0.09med. time65s
#2 GPT-5 mini94
success94%cost / win$0.01med. time98s
#3 GPT-5.191
success96%cost / win$0.09med. time115s
#4 GPT-590
success96%cost / win$0.14med. time140s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.
openai-hardest5-4x5
OpenAI (Codex CLI)5 tasks · 100 runs · default★ GPT-5 mini97
success100%cost / win$0.01med. time97s
#2 GPT-5.495
success100%cost / win$0.08med. time65s
#3 GPT-5.193
success100%cost / win$0.09med. time125s
#4 GPT-592
success100%cost / win$0.18med. time174s
None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.