AI Unmasked

Provider

About

Why AgentDelta exists

The narrative around AI models leans blindly toward the latest and greatest, with little discernment and rarely any justification for the additional cost. When a model is called “better,” that judgment usually rests on a subjective, ungrounded perception rather than on objective metrics.

It is often not even clear whether a newer model is genuinely more capable or is simply doing more agentic work: more tool calls, longer workflows, more retries, more self-review. AgentDelta measures the delta under controlled, repeatable, auditable conditions, and reports cost, latency, and amplification alongside capability, the metrics that should actually drive an upgrade decision.

Source on GitHub →

Results first

Does the upgrade actually hold up?

AgentDelta ranks coding-agent systems by objective success on real, repo-level tasks graded by hidden tests, and decides which newer-model gains are material rather than noise. Each suite below is a controlled comparison; the top-ranked model is the one that earned it.

anthropic-hardest5-4x5

Anthropic (Claude Code)5 tasks · 100 runs · default

★ Opus 4.899

success100%cost / win$0.24med. time64s

#2 Opus 4.798

success100%cost / win$0.27med. time70s

#3 Opus 4.695

success100%cost / win$0.37med. time132s

#4 Sonnet 4.690

success88%cost / win$0.20med. time89s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-longcontext-4x5

Anthropic (Claude Code)3 tasks · 60 runs · default

★ Opus 4.8100

success100%cost / win$0.45med. time128s

#2 Opus 4.799

success100%cost / win$0.51med. time133s

#3 Opus 4.675

success67%cost / win$0.81med. time150s

#4 Sonnet 4.665

success53%cost / win$1.05med. time182s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-statemachine-4x10

Anthropic (Claude Code)5 tasks · 200 runs · default

★ Opus 4.8100

success100%cost / win$0.22med. time56s

#2 Opus 4.799

success100%cost / win$0.23med. time62s

#3 Opus 4.696

success100%cost / win$0.32med. time107s

#4 Sonnet 4.693

success94%cost / win$0.21med. time105s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-concurrency

Anthropic (Claude Code)1 tasks · 12 runs · default

★ Opus 4.799

success100%cost / win$0.22med. time70s

#2 Sonnet 4.699

success100%cost / win$0.19med. time97s

#3 Opus 4.898

success100%cost / win$0.28med. time73s

#4 Opus 4.697

success100%cost / win$0.30med. time91s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-eqbudget

Anthropic (Claude Code)2 tasks · 32 runs · default, equal_budget

★ Opus 4.8100

success100%cost / win$0.25med. time61s

#2 Opus 4.797

success100%cost / win$0.32med. time91s

#3 Sonnet 4.697

success100%cost / win$0.28med. time133s

#4 Opus 4.695

success100%cost / win$0.42med. time129s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-frontier

Anthropic (Claude Code)3 tasks · 24 runs · default

★ Opus 4.8100

success100%cost / win$0.24med. time55s

#2 Opus 4.699

success100%cost / win$0.22med. time75s

#3 Opus 4.798

success100%cost / win$0.28med. time79s

#4 Sonnet 4.697

success100%cost / win$0.23med. time101s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-hard

Anthropic (Claude Code)3 tasks · 24 runs · default

★ Opus 4.8100

success100%cost / win$0.25med. time54s

#2 Opus 4.799

success100%cost / win$0.25med. time65s

#3 Sonnet 4.697

success100%cost / win$0.24med. time128s

#4 Opus 4.696

success100%cost / win$0.32med. time136s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

anthropic-v0.1-pilot

Anthropic (Claude Code)3 tasks · 24 runs · default

★ Opus 4.898

success100%cost / win$0.22med. time52s

#2 Sonnet 4.698

success100%cost / win$0.15med. time82s

#3 Opus 4.798

success100%cost / win$0.21med. time59s

#4 Opus 4.695

success100%cost / win$0.29med. time109s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

deepseek-full50-2x5

DeepSeek (OpenCode)50 tasks · 500 runs · default

★ DeepSeek V4 Flash98

success98%cost / win$0.00med. time85s

#2 DeepSeek V4 Pro95

success100%cost / win$0.01med. time100s

None of the 1 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

gemini-full50-4x1

Google (Gemini CLI)50 tasks · 200 runs · default

★ Gemini 3.1 Flash Lite96

success96%cost / win$0.02med. time78s

#2 Gemini 2.5 Pro92

success96%cost / win$0.08med. time85s

#3 Gemini 3.5 Flash89

success96%cost / win$0.27med. time134s

#4 Gemini 2.5 Flash83

success78%cost / win$0.03med. time65s

3 of 3 newer-model gains are material (survive the Holm-corrected paired test).

kimi-full50-1x1

Moonshot Kimi (OpenCode)50 tasks · 50 runs · default

★ Kimi K2.7 Code100

success100%cost / win$0.03med. time51s

No newer-over-older pairs evaluated in this suite.

openai-full50-4x1

OpenAI (Codex CLI)50 tasks · 200 runs · default

★ GPT-5.495

success100%cost / win$0.09med. time65s

#2 GPT-5 mini94

success94%cost / win$0.01med. time98s

#3 GPT-5.191

success96%cost / win$0.09med. time115s

#4 GPT-590

success96%cost / win$0.14med. time140s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

openai-hardest5-4x5

OpenAI (Codex CLI)5 tasks · 100 runs · default

★ GPT-5 mini97

success100%cost / win$0.01med. time97s

#2 GPT-5.495

success100%cost / win$0.08med. time65s

#3 GPT-5.193

success100%cost / win$0.09med. time125s

#4 GPT-592

success100%cost / win$0.18med. time174s

None of the 3 newer-model gains are material. Where models tie, the tie is reported, not a manufactured ranking.

Why AgentDelta exists

Source on GitHub →