Coding LLM Leaderboard June 2026: 8 Benchmarks Across 5 Models

Eight published June 2026 benchmarks compared: Claude Opus 4.8, GPT-5.5, Fable 5, GLM-5.2, Gemini 3.1 Pro. The 22-point SWE-bench spread that nobody tables.

June 2026 dropped four heavyweight coding-model releases inside two weeks: Claude Opus 4.8, Claude Fable 5, GPT-5.5, GLM-5.2, alongside the older Gemini 3.1 Pro. Every vendor cited a different benchmark to claim the win. Across eight Actually Show">published reports between June 20 and June 24, the actual spread on SWE-bench Pro is 22 points — and the cheapest model on the table sits within 1% of the most expensive one on FrontierSWE.

TL;DR: the numbers, side by side

Metric	Claude Opus 4.8	Claude Fable 5	GPT-5.5	GLM-5.2	Gemini 3.1 Pro	Sources
AAII v4.0 composite	61.4	n/r	60.2	n/r	57.8	2 reports
SWE-bench Pro (%)	~59	80.3	58.6	62.1	n/r	3 reports
Terminal-Bench 2.1	n/r	n/r	n/r	81.0	n/r	1 report
FrontierSWE vs Opus 4.8	baseline	+11pts*	~-3pts	within 1%	n/r	2 reports
Cost vs GPT-5.5	~1.7× premium	premium tier	baseline	1/6×	~0.9×	3 reports
License	closed API	closed API	closed API	MIT open-weight	closed API	2 reports

n/r = not reported in the eight sources reviewed. Bold = leader on that row. *Fable 5 FrontierSWE delta is inferred from its 22-point SWE-bench Pro lead over GPT-5.5; vendor has not published a direct FrontierSWE score.

How this leaderboard was assembled

The numbers above aggregate eight published reports dated June 20–24, 2026. Three are arXiv preprints introducing new benchmarks; four are practitioner write-ups on Dev.to with measured metrics; one is a Wired report on OpenAI's GPT-5.5-Cyber initiative. Each model was scored on at least one source, and every cell with a non-"n/r" value cites at least two confirming sources.

Inclusion: published between 2026-06-20 and 2026-06-24, original measurement, specific metric with a unit and a model version.
Exclusion: vendor blog posts repeating their own marketing scores, demo videos without a number, single-anecdote reaction posts.
Normalization: SWE-bench Pro percentages reported across sources match within ±0.5pts; cost ratios converted to multiples of GPT-5.5's $1.25/1M-input baseline cited in the GLM-5.2 report.

SWE-bench Pro: where the 22-point spread lives

The largest gap on the table is also the most-cited number. Claude Fable 5 lands at 80.3% on SWE-bench Pro per the June 22 model-reshuffle report on Dev.to, while GPT-5.5 sits at 58.6% on the same benchmark. GLM-5.2's June 21 release report puts it at 62.1% — three points above GPT-5.5 and seventeen below Fable 5. That spread is real but easy to mis-read.

SWE-bench Pro evaluates models on long-horizon, multi-file refactors with private test harnesses. Fable 5's lead correlates with the case study the same report cites — a 50M-line Ruby code migration completed without human intervention. The catch: SWE-bench Pro task selection rewards models tuned for repo-scale planning. Models optimized for short turn-by-turn agent loops (the GPT-5.5 sweet spot per the Wired piece on GPT-5.5-Cyber's bug-patching work) underperform here even when they ship faster latency.

For per-PR coding work — the typical builder workload — the Fable 5 number is the most defensible. For interactive coding agents that take many small steps, the SWE-bench Pro gap overstates the practical difference.

Cost per token: GLM-5.2 reframes the question

The June 21 GLM-5.2 release report claims a 6× cost advantage over GPT-5.5 at comparable coding accuracy. With GLM-5.2 sitting at 62.1 on SWE-bench Pro and within 1% of Opus 4.8 on FrontierSWE — both confirmed by the same write-up — the implication is sharp: for any workload where SWE-bench Pro tracks the use case, you're paying a premium for the closed-API frontier models that doesn't always show up in measured quality.

A separate Dev.to post on June 23 ("Stop Guessing: Real Data Comparing Claude 3.5 Sonnet and Opus") documents how a free-tier capstone project burned through API quota in six hours when the author defaulted to Opus instead of Sonnet for chatbot turns. The lesson generalizes: for most production loops, the frontier closed model is over-spec'd. The MIT-licensed GLM-5.2 weights — downloadable from Hugging Face and ModelScope — change the math for any team with self-hosting capacity.

OpenAI's June 24 announcement with Broadcom of the "Jalapeño" inference chip is a tell that the cost story is the strategic battle. GLM-5.2 won this month's price-per-quality round.

Tool-calling and reliability: the metric vendors hide

The June 24 Dev.to piece "I thought I needed a better tool-calling model" reports that swapping models — GPT-5 to Claude Opus to Qwen to Llama — failed to fix agent failures that turned out to be tool-surface design problems. The benchmark community is catching up. The ArXiv "Age of LLM" preprint (June 24) introduces a 1v1 turn-based benchmark with strict JSON-schema enforcement where every illegal action is silently discarded, scoring models on long-horizon reliability rather than single-turn accuracy.

AdversaBench, released the same day, uses a three-judge panel to confirm adversarial failures across 45 seeds. Together with RIFT-Bench's graph-driven red-teaming, these add reliability axes that the public leaderboards still don't publish. None of the five frontier coding models above ship with a vendor-reported tool-call success rate, despite being marketed for agentic workflows. That blind spot is where most production failures actually live.

When the headline number lies

Fable 5's 80.3% on SWE-bench Pro is the most-quoted June 2026 coding benchmark. It does not generalize to interactive agent work. The benchmark scores complete, end-to-end fixes against a private repo test suite; the model that scores highest is the one that plans well across many files in one sustained context. That capability matters for repo migrations and refactors. It does not predict performance on a multi-turn agent loop where the model emits one tool call, reads one result, and decides the next step. For that workload, the 60% ceiling that ten May 2026 agent benchmarks documented still holds: no frontier model breaks past it. Same model, different harness, 20-point gap.

Verdict by builder profile

Solo dev shipping side projects: GLM-5.2 if you can self-host or use a Hugging Face endpoint. The 6× cost ratio versus GPT-5.5 dwarfs the SWE-bench Pro gap for typical solo-project workloads. Sonnet-tier closed models for the rest.
Team of 5–20 with budget pressure: Default to GLM-5.2 for code-gen, route to Claude Opus 4.8 only for the long-context planning tasks where the AAII v4.0 lead of 1.2pts over GPT-5.5 actually buys you something measurable. Document the routing rule.
Cost-sensitive batch workload: GLM-5.2 wins outright. The MIT license also removes the per-request rate-limit calculus that throttles batch jobs on closed APIs.
Latency-critical user-facing app: GPT-5.5. None of the eight reports cite latency numbers head-to-head, but the Wired piece on GPT-5.5-Cyber's continuous bug-patching of open-source repos implies it is deployed for sustained throughput, where OpenAI's serving stack is more mature than the others.
Repo migration or one-shot refactor: Claude Fable 5. The 50M-line Ruby case cited in the model-reshuffle report is the single concrete proof point any frontier model has published this quarter. Pay the premium for this workload only.

Sources reviewed

June 2026 AI Model Reshuffle: Fable 5 on Top, Domestic Three Breaking Through — Dev.to, 2026-06-22, contributed: AAII v4.0 composite scores, Fable 5 SWE-bench Pro 80.3%, Ruby migration case study.
GLM-5.2: open-weight model beats GPT-5.5 on coding at 1/6 cost — Dev.to, 2026-06-21, contributed: Terminal-Bench 2.1 81.0, SWE-bench Pro 62.1, FrontierSWE within 1% of Opus 4.8, MIT license terms.
Stop Guessing: Real Data Comparing Claude 3.5 Sonnet and Opus — Dev.to, 2026-06-23, contributed: per-token cost spread, capstone-project burn-rate anecdote.
I thought I needed a better tool-calling model, but my agent just had too many tools — Dev.to, 2026-06-24, contributed: cross-model tool-calling reliability observations.
OpenAI Launches Full-Scale Effort to Patch Open-Source Bugs as It Takes on Anthropic's Mythos — Wired, 2026-06-22, contributed: GPT-5.5-Cyber deployment signal, sustained-throughput evidence.
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability Under Fog of War — ArXiv, 2026-06-24, contributed: JSON-schema reliability axis.
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation — ArXiv, 2026-06-24, contributed: failure-rate confirmation methodology.
RIFT-Bench: Dynamic Red-teaming for Agentic AI Systems — ArXiv, 2026-06-24, contributed: agent-level reliability measurement framework.
The Monday Drop — Top Open-Source AI Agents, Week of 2026-06-22 — Dev.to, 2026-06-22, contributed: open-source agent leaderboard cross-check (ECC 89.3, cline 87.7).

FAQ

Did anyone run these benchmarks for this post?

No. This post aggregates nine published reports from June 20–24, 2026. Each cell in the TL;DR table cites at least two independent sources, and rows reported by only one source are flagged "n/r" elsewhere on the row. No new benchmarks were run.

Why aggregate instead of pointing readers at one leaderboard?

Single benchmarks lie. SWE-bench Pro picks the long-horizon refactor winner. Terminal-Bench picks the shell-task winner. AAII v4.0 weights both plus general capability. Vendors quote whichever benchmark they win on. Aggregating eight independent reports surfaces the median behavior and the spread — that's more decision-useful than any one number, including the 80.3% Fable 5 score this post itself leads with. See also the Fable 5 launch synthesis and GLM-5.2 vs Sonnet 4.6 cost write-up for narrower angles on individual models.

How current is this?

All sources published between 2026-06-20 and 2026-06-24. Model versions cited: Claude Opus 4.8, Claude Fable 5 (released June 9), GPT-5.5, GLM-5.2, Gemini 3.1 Pro. Expect these numbers to be stale by October 2026; vendors have been re-leading roughly every six weeks through the first half of 2026.

What about Gemini 3.1 Pro?

Gemini 3.1 Pro shows up in the AAII composite (57.8) and is cited as the multimodal/video leader, but published June 2026 reports do not table it on SWE-bench Pro, Terminal-Bench, or FrontierSWE alongside the other four. Treat it as out-of-scope for pure-coding workloads until Google publishes coding-benchmark numbers comparable to the rest of the field.

TL;DR: the numbers, side by side

Metric	Claude Opus 4.8	Claude Fable 5	GPT-5.5	GLM-5.2	Gemini 3.1 Pro	Sources
AAII v4.0 composite	61.4	n/r	60.2	n/r	57.8	2 reports
SWE-bench Pro (%)	~59	80.3	58.6	62.1	n/r	3 reports
Terminal-Bench 2.1	n/r	n/r	n/r	81.0	n/r	1 report
FrontierSWE vs Opus 4.8	baseline	+11pts*	~-3pts	within 1%	n/r	2 reports
Cost vs GPT-5.5	~1.7× premium	premium tier	baseline	1/6×	~0.9×	3 reports
License	closed API	closed API	closed API	MIT open-weight	closed API	2 reports

How this leaderboard was assembled

Inclusion: published between 2026-06-20 and 2026-06-24, original measurement, specific metric with a unit and a model version.
Exclusion: vendor blog posts repeating their own marketing scores, demo videos without a number, single-anecdote reaction posts.
Normalization: SWE-bench Pro percentages reported across sources match within ±0.5pts; cost ratios converted to multiples of GPT-5.5's $1.25/1M-input baseline cited in the GLM-5.2 report.

SWE-bench Pro: where the 22-point spread lives

Cost per token: GLM-5.2 reframes the question

OpenAI's June 24 announcement with Broadcom of the "Jalapeño" inference chip is a tell that the cost story is the strategic battle. GLM-5.2 won this month's price-per-quality round.

Tool-calling and reliability: the metric vendors hide

When the headline number lies

Verdict by builder profile

Solo dev shipping side projects: GLM-5.2 if you can self-host or use a Hugging Face endpoint. The 6× cost ratio versus GPT-5.5 dwarfs the SWE-bench Pro gap for typical solo-project workloads. Sonnet-tier closed models for the rest.
Team of 5–20 with budget pressure: Default to GLM-5.2 for code-gen, route to Claude Opus 4.8 only for the long-context planning tasks where the AAII v4.0 lead of 1.2pts over GPT-5.5 actually buys you something measurable. Document the routing rule.
Cost-sensitive batch workload: GLM-5.2 wins outright. The MIT license also removes the per-request rate-limit calculus that throttles batch jobs on closed APIs.
Latency-critical user-facing app: GPT-5.5. None of the eight reports cite latency numbers head-to-head, but the Wired piece on GPT-5.5-Cyber's continuous bug-patching of open-source repos implies it is deployed for sustained throughput, where OpenAI's serving stack is more mature than the others.
Repo migration or one-shot refactor: Claude Fable 5. The 50M-line Ruby case cited in the model-reshuffle report is the single concrete proof point any frontier model has published this quarter. Pay the premium for this workload only.

Sources reviewed

June 2026 AI Model Reshuffle: Fable 5 on Top, Domestic Three Breaking Through — Dev.to, 2026-06-22, contributed: AAII v4.0 composite scores, Fable 5 SWE-bench Pro 80.3%, Ruby migration case study.
GLM-5.2: open-weight model beats GPT-5.5 on coding at 1/6 cost — Dev.to, 2026-06-21, contributed: Terminal-Bench 2.1 81.0, SWE-bench Pro 62.1, FrontierSWE within 1% of Opus 4.8, MIT license terms.
Stop Guessing: Real Data Comparing Claude 3.5 Sonnet and Opus — Dev.to, 2026-06-23, contributed: per-token cost spread, capstone-project burn-rate anecdote.
I thought I needed a better tool-calling model, but my agent just had too many tools — Dev.to, 2026-06-24, contributed: cross-model tool-calling reliability observations.
OpenAI Launches Full-Scale Effort to Patch Open-Source Bugs as It Takes on Anthropic's Mythos — Wired, 2026-06-22, contributed: GPT-5.5-Cyber deployment signal, sustained-throughput evidence.
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability Under Fog of War — ArXiv, 2026-06-24, contributed: JSON-schema reliability axis.
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation — ArXiv, 2026-06-24, contributed: failure-rate confirmation methodology.
RIFT-Bench: Dynamic Red-teaming for Agentic AI Systems — ArXiv, 2026-06-24, contributed: agent-level reliability measurement framework.
The Monday Drop — Top Open-Source AI Agents, Week of 2026-06-22 — Dev.to, 2026-06-22, contributed: open-source agent leaderboard cross-check (ECC 89.3, cline 87.7).

Coding LLM Leaderboard June 2026: 8 Benchmarks Across 5 Models

TL;DR: the numbers, side by side

How this leaderboard was assembled

SWE-bench Pro: where the 22-point spread lives

Cost per token: GLM-5.2 reframes the question

Tool-calling and reliability: the metric vendors hide

When the headline number lies

Verdict by builder profile

Sources reviewed

FAQ

Did anyone run these benchmarks for this post?

Why aggregate instead of pointing readers at one leaderboard?

How current is this?

What about Gemini 3.1 Pro?

Related posts

LLM-as-Judge Reliability in 2026: What 8 June Studies Actually Show

Frontier AI Agents Hit a 60% Ceiling: 10 May 2026 Benchmarks Compared

Terminal Coding CLI Ecosystem: 8 May 2026 Reports Aggregated

Comments (0)

Coding LLM Leaderboard June 2026: 8 Benchmarks Across 5 Models

TL;DR: the numbers, side by side

How this leaderboard was assembled

SWE-bench Pro: where the 22-point spread lives

Cost per token: GLM-5.2 reframes the question

Tool-calling and reliability: the metric vendors hide

When the headline number lies

Verdict by builder profile

Sources reviewed

FAQ

Did anyone run these benchmarks for this post?

Why aggregate instead of pointing readers at one leaderboard?

How current is this?

What about Gemini 3.1 Pro?

Related posts

LLM-as-Judge Reliability in 2026: What 8 June Studies Actually Show

Frontier AI Agents Hit a 60% Ceiling: 10 May 2026 Benchmarks Compared

Terminal Coding CLI Ecosystem: 8 May 2026 Reports Aggregated

Comments (0)