AI Unmasked
HomeCoursesDemosAgentDeltaAbout
Sign In
AI UnmaskedIntelligence without the hype.
SubstackAbout© 2026
AgentDelta

Reproducible, objective evaluation of frontier coding agents and model-upgrade deltas.

agentdelta-v0.1
Generated 2026-06-19 11:57 UTC
Baseline Opus 4.6
GitHub →
Provider
About

Why AgentDelta exists

The narrative around AI models leans blindly toward the latest and greatest, with little discernment and rarely any justification for the additional cost. When a model is called “better,” that judgment usually rests on a subjective, ungrounded perception rather than on objective metrics.

It is often not even clear whether a newer model is genuinely more capable or is simply doing more agentic work: more tool calls, longer workflows, more retries, more self-review. AgentDelta measures the delta under controlled, repeatable, auditable conditions, and reports cost, latency, and amplification alongside capability, the metrics that should actually drive an upgrade decision.

Source on GitHub →
Results first

Does the upgrade actually hold up?

AgentDelta ranks coding-agent systems by objective success on real, repo-level tasks graded by hidden tests, and decides which newer-model gains are material rather than noise. Each suite below is a controlled comparison; the top-ranked model is the one that earned it.

anthropic-hardest5-4x5

Anthropic (Claude Code)5 tasks · 100 runs · default
★ Opus 4.899
success100%cost / win$0.24med. time64s
#2 Opus 4.798
success100%cost / win$0.27med. time70s
#3 Opus 4.695
success100%cost / win$0.37med. time132s
#4 Sonnet 4.690
success88%cost / win$0.20med. time89s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-longcontext-4x5

Anthropic (Claude Code)3 tasks · 60 runs · default
★ Opus 4.8100
success100%cost / win$0.45med. time128s
#2 Opus 4.799
success100%cost / win$0.51med. time133s
#3 Opus 4.675
success67%cost / win$0.81med. time150s
#4 Sonnet 4.665
success53%cost / win$1.05med. time182s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-statemachine-4x10

Anthropic (Claude Code)5 tasks · 200 runs · default
★ Opus 4.8100
success100%cost / win$0.22med. time56s
#2 Opus 4.799
success100%cost / win$0.23med. time62s
#3 Opus 4.696
success100%cost / win$0.32med. time107s
#4 Sonnet 4.693
success94%cost / win$0.21med. time105s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-concurrency

Anthropic (Claude Code)1 tasks · 12 runs · default
★ Opus 4.799
success100%cost / win$0.22med. time70s
#2 Sonnet 4.699
success100%cost / win$0.19med. time97s
#3 Opus 4.898
success100%cost / win$0.28med. time73s
#4 Opus 4.697
success100%cost / win$0.30med. time91s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-eqbudget

Anthropic (Claude Code)2 tasks · 32 runs · default, equal_budget
★ Opus 4.8100
success100%cost / win$0.25med. time61s
#2 Opus 4.797
success100%cost / win$0.32med. time91s
#3 Sonnet 4.697
success100%cost / win$0.28med. time133s
#4 Opus 4.695
success100%cost / win$0.42med. time129s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-frontier

Anthropic (Claude Code)3 tasks · 24 runs · default
★ Opus 4.8100
success100%cost / win$0.24med. time55s
#2 Opus 4.699
success100%cost / win$0.22med. time75s
#3 Opus 4.798
success100%cost / win$0.28med. time79s
#4 Sonnet 4.697
success100%cost / win$0.23med. time101s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-hard

Anthropic (Claude Code)3 tasks · 24 runs · default
★ Opus 4.8100
success100%cost / win$0.25med. time54s
#2 Opus 4.799
success100%cost / win$0.25med. time65s
#3 Sonnet 4.697
success100%cost / win$0.24med. time128s
#4 Opus 4.696
success100%cost / win$0.32med. time136s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-pilot

Anthropic (Claude Code)3 tasks · 24 runs · default
★ Opus 4.898
success100%cost / win$0.22med. time52s
#2 Sonnet 4.698
success100%cost / win$0.15med. time82s
#3 Opus 4.798
success100%cost / win$0.21med. time59s
#4 Opus 4.695
success100%cost / win$0.29med. time109s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

deepseek-full50-2x5

DeepSeek (OpenCode)50 tasks · 500 runs · default
★ DeepSeek V4 Flash98
success98%cost / win$0.00med. time85s
#2 DeepSeek V4 Pro95
success100%cost / win$0.01med. time100s

None of the 1 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

gemini-full50-4x1

Google (Gemini CLI)50 tasks · 200 runs · default
★ Gemini 3.1 Flash Lite96
success96%cost / win$0.02med. time78s
#2 Gemini 2.5 Pro92
success96%cost / win$0.08med. time85s
#3 Gemini 3.5 Flash89
success96%cost / win$0.27med. time134s
#4 Gemini 2.5 Flash83
success78%cost / win$0.03med. time65s

3 of 3 newer-model gains are material (survive the Holm-corrected paired test).

kimi-full50-1x1

Moonshot Kimi (OpenCode)50 tasks · 50 runs · default
★ Kimi K2.7 Code100
success100%cost / win$0.03med. time51s

No newer-over-older pairs evaluated in this suite.

openai-full50-4x1

OpenAI (Codex CLI)50 tasks · 200 runs · default
★ GPT-5.495
success100%cost / win$0.09med. time65s
#2 GPT-5 mini94
success94%cost / win$0.01med. time98s
#3 GPT-5.191
success96%cost / win$0.09med. time115s
#4 GPT-590
success96%cost / win$0.14med. time140s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

openai-hardest5-4x5

OpenAI (Codex CLI)5 tasks · 100 runs · default
★ GPT-5 mini97
success100%cost / win$0.01med. time97s
#2 GPT-5.495
success100%cost / win$0.08med. time65s
#3 GPT-5.193
success100%cost / win$0.09med. time125s
#4 GPT-592
success100%cost / win$0.18med. time174s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.