What benchmark did GPT-5.4-mini hit on June 12 announcements?

GPT-5.4-mini scored 71.2% on SWE-bench Verified and landed within 4 points of the GPT-5.4 flagship on Terminal-Bench, while costing approximately one-eighth the price. It also maintained tool-use coherence through an average of 18 steps, up from the historical mini-tier degradation point around step 8 or 9.

How did Anthropic change Claude Opus 4.7 pricing on June 12?

Anthropic kept Claude Opus 4.7 at $5 input and $25 output per million tokens — unchanged from the 4.6 release — but extended prompt caching discounts to a 90% cache read rate. This significantly reduces effective costs for long-context workloads that repeatedly reference the same documents or system prompts.

What changed about Gemini 3.1 Pro long-context pricing structure?

Google eliminated the previous surcharge applied above 200K tokens, collapsing Gemini 3.1 Pro's tiered long-context pricing into a flat $2 input / $12 output rate across the full 1M token window. This makes Gemini 3.1 Pro substantially more economical for RAG pipelines and document-heavy workloads at scale.

What is Terminal-Bench and why does it matter for developers?

Terminal-Bench evaluates whether a model can complete real shell-based engineering tasks — cloning repos, reading stack traces, editing files, running tests, and iterating across multi-step trajectories. It's a strong proxy for inner agent loop capability, making it more relevant than many academic benchmarks for production coding agent deployments.

Should developers still default to flagship models for agent workloads?

According to the analysis, the flagship-default is now actively wrong for somewhere between 40% and 70% of typical agent workloads. The June 12 capability jump in mini-tier models means teams should rerun their evaluation harnesses and routing logic, particularly for classification, extraction, and inner agent loop tasks previously reserved for frontier models.

How should teams rerun evaluations without excessive engineering overhead?

The article recommends pulling updated numbers from official model cards and OpenRouter's pricing index, then targeting the specific benchmark axes — SWE-bench, Terminal-Bench, and long-context recall — most relevant to your workload. Scoping re-evaluation to tasks affected by the three June 12 deltas avoids a full harness rewrite while still updating critical routing assumptions.

How to

The Big Model Comparisons Story: What June 12’s News Means for Developers

Markos Symeonides

June 13, 2026

“`html

[IMAGE_PLACEHOLDER_HEADER]

⚡ TL;DR — Key Takeaways

What it is: A technical breakdown of the June 12 AI model announcements from OpenAI, Anthropic, and Google, and what the combined pricing and capability shifts mean for developer architecture decisions in 2026.
Who it’s for: Backend and AI engineers running multi-model production systems, coding agents, RAG pipelines, or customer-facing assistants who need to re-evaluate their model-routing and cost assumptions.
Key takeaways: GPT-5.4-mini now scores 71.2% on SWE-bench Verified and holds tool-use coherence to step 18; Claude Opus 4.7 extends 90% prompt cache read discounts; Gemini 3.1 Pro flattens long-context pricing to $2/$12 across the full 1M token window — invalidating most routing logic built before March.
Pricing/Cost: Claude Opus 4.7 holds at $5 input / $25 output per million tokens; Gemini 3.1 Pro collapses its long-context surcharge to a flat $2/$12 rate; GPT-5.4-mini runs at roughly one-eighth the cost of GPT-5.4 flagship.
Bottom line: The “use the biggest model” default is now wrong for 40–70% of typical agent workloads; small models got smarter faster than large models got cheaper, and teams should rerun evaluation harnesses against updated benchmarks before their next architecture review.

✦
Get 40K Prompts, Guides & Tools — Free
→

✓ Instant access✓ No spam✓ Unsubscribe anytime

June 12 Reset the Model Selection Calculus

[IMAGE_PLACEHOLDER_SECTION_1]

On June 12, three major AI industry players — OpenAI, Anthropic, and Google — released significant updates within roughly an eight-hour window. Anthropic rolled out updated pricing tiers for Claude Opus 4.7, OpenAI published new benchmarks for GPT-5.4-mini that surpassed last quarter’s flagship on Terminal-Bench, and Google flattened Gemini 3.1 Pro’s long-context pricing structure. While each announcement alone might have been routine, their combined impact has fundamentally shifted the economics and architecture of multi-model AI systems.

Developers running coding agents, retrieval-augmented generation (RAG) pipelines, or customer-facing assistants must now reconsider assumptions about cost, latency, and model capabilities. The default “use the biggest model” strategy is no longer optimal for a large fraction of workloads — up to 40-70% by some estimates. This article explains what changed, why it matters for developers, and how to efficiently rerun your evaluation harnesses to capture the new reality.

We draw data directly from official model cards and OpenRouter pricing indexes as of June 2026, but recommend verifying numbers before making migration decisions.

What Actually Shipped on June 12

[IMAGE_PLACEHOLDER_SECTION_2]

The June 12 announcements delivered three key updates:

OpenAI’s GPT-5.4-mini Benchmark Leap: The mini-tier variant of GPT-5.4 scored 71.2% on SWE-bench Verified and achieved near-flagship Terminal-Bench performance, maintaining stepwise tool coherence to step 18. This marks a big jump in small-model capability at roughly one-eighth the flagship cost.
Anthropic’s Claude Opus 4.7 Pricing and Cache Discount: Maintained pricing at $5 input/$25 output per million tokens but extended prompt caching discounts to a 90% cache read rate, slashing effective costs for long-context workloads.
Google’s Gemini 3.1 Pro Pricing Flattening: Eliminated the surcharge on tokens beyond 200K, flattening long-context pricing to a $2 input/$12 output flat rate across a full 1 million token window — a game changer for scaling document-heavy applications.

These updates invalidate many routing assumptions baked into systems built before March 2026. The interplay between improved small-model capabilities and cache-friendly pricing requires reevaluation of model routing logic for cost-efficiency and quality.

The Small-Model Capability Jump

The Terminal-Bench benchmark focuses on a model’s ability to perform real engineering tasks in a shell environment: cloning repositories, debugging with stack traces, editing code, running tests, and iterating over multiple steps. Historically, mini-tier models faltered beyond 8-9 steps, limiting their utility for agent inner loops.

Now, GPT-5.4-mini maintains coherence through 18 steps on average, nearly matching flagship hallucination resistance. This shifts the paradigm: mini-tier models can now reliably drive inner agent loops, not just classification or extraction tasks. For detailed OpenAI specs, see their models documentation.

Prompt Caching Becomes the Real Pricing Story

Raw token pricing barely changed on frontier models, but effective cost dropped sharply when factoring in prompt caching. Claude Opus 4.7’s 90% cache read discount means long system prompts (e.g., 180K tokens in coding agents) cost roughly 10% of February prices when reused across sessions.

For multi-turn agent workloads reusing context 30-200 times per session, Claude’s cache-optimized pricing undercuts even GPT-5.2’s uncached costs, flipping routing decisions teams made earlier this year.

For engineering trade-offs and cost-quality decisions behind caching, see our related article The Big AI Coding Agents Story: What June 08’s News Means for Developers.

The New Price-Performance Frontier, In Numbers

[IMAGE_PLACEHOLDER_SECTION_3]

Get Free Access to 40,000+ AI Prompts

Join 40,000+ AI professionals. Get instant access to our curated Notion Prompt Library with prompts for ChatGPT, Claude, Codex, Gemini, and more — completely free.

Get Free Access Now →

No spam. Instant access. Unsubscribe anytime.

The following table compares input/output pricing, context window sizes, benchmark scores, and cache discounts across leading models. Prices are per million tokens, benchmarks from official model cards as of June 2026.

Model	Input $/M	Output $/M	Context	SWE-bench Verified (%)	MMLU-Pro (%)	Cache Discount
gpt-5.4	$2.50	$20	400K	78.4%	89.1%	50% read
gpt-5.4-mini	$0.30	$2.40	400K	71.2%	84.7%	50% read
gpt-5.4-nano	$0.08	$0.60	200K	52.1%	76.3%	50% read
gpt-5.4-pro	$12	$80	400K	82.0%	91.4%	50% read
gpt-5.5	$5	$30	1.05M	80.1%	90.2%	75% read
claude-opus-4.7	$5	$25	500K	79.8%	88.9%	90% read
claude-sonnet-4.6	$2	$10	500K	73.6%	85.4%	90% read
claude-haiku-4.5	$0.40	$2	300K	61.0%	80.1%	90% read
gemini-3.1-pro-preview	$2	$12	1M	74.2%	87.0%	75% read
gemini-3-flash	$0.20	$1.50	1M	58.4%	78.5%	75% read

Interpret this table focusing on price-performance inversions rather than outright winners:

Mini-tier models like GPT-5.4-mini and Claude Sonnet 4.6 occupy a cost-effective price-performance niche, with routing choices hinging on whether workloads benefit from Anthropic’s 90% cache discount.
GPT-5.5 with a 1.05M context window encroaches on Claude Opus 4.7’s domain for long documents but loses out on cache economics for repeated reads.
Gemini 3 Flash offers the cheapest path to 1M context, but its SWE-bench score suggests it’s less suitable for code agents.

The “Mini Beats Last Year’s Flagship” Pattern

A recurring pattern across product cycles is that the latest mini-tier model outperforms the previous generation’s flagship on key benchmarks at a fraction of the cost. For example, GPT-5.4-mini outperforms GPT-5.1 and closely matches GPT-5.2, while costing 10-20x less. The same holds for Claude Haiku 4.5 versus Claude Opus 4.5 from last year.

Developers still routing requests to two-release-old models to save money are effectively paying more for worse quality. Migrating to current mini-tier models often pays off in under a week of production traffic.

For a detailed tutorial on leveraging new GPT customization features to optimize model usage, see our internal guide Master

Markos Symeonides

The Complete GPT-5.6 Migration Masterclass: Moving from GPT-5.5 to Sol, Terra, or Luna

Posted in How to

Reading Time: 24 minutes

Comprehensive migration guide for developers and teams moving from GPT-5.5 to GPT-5.6. Cover API endpoint changes, model name updates, prompt format differences, new parameters and capabilities, handl

OpenAI’s $2.5 Billion Ad Revenue Bet: How ChatGPT Ads Are Reshaping Digital Marketing in 2026

Posted in How to

Reading Time: 18 minutes

Deep analysis of OpenAI’s advertising ambitions. ChatGPT ads hit $100M ARR in under 2 months after launch. OpenAI forecasting $2.5B in ad revenue for 2026 and $100B by 2030. Cover the ad format (nativ

25 ChatGPT-5.5 Prompts for HR Professionals: Recruitment, Onboarding, Performance Reviews, and Employee Communications

Posted in How to

Reading Time: 27 minutes

25 ready-to-use prompts organized into sections: Recruitment & Talent Acquisition (job descriptions, screening criteria, interview questions, offer letters), Onboarding (welcome materials, training pl

How to Build AI Agents on Amazon Bedrock with GPT-5.6: Step-by-Step Developer Tutorial

Posted in How to

Reading Time: 21 minutes

Step-by-step tutorial for developers on building AI agents using GPT-5.6 (Sol/Terra/Luna) on Amazon Bedrock, which is now GA. Cover setup, authentication, prompt caching (90% savings), agent architect

The Big Model Comparisons Story: What June 12’s News Means for Developers

June 12 Reset the Model Selection Calculus

What Actually Shipped on June 12

The Small-Model Capability Jump

Prompt Caching Becomes the Real Pricing Story

The New Price-Performance Frontier, In Numbers

Get Free Access to 40,000+ AI Prompts

The “Mini Beats Last Year’s Flagship” Pattern

Get Free Access to 40,000+ AI Prompts for ChatGPT, Claude & Codex

More on this

The Complete GPT-5.6 Migration Masterclass: Moving from GPT-5.5 to Sol, Terra, or Luna

OpenAI’s $2.5 Billion Ad Revenue Bet: How ChatGPT Ads Are Reshaping Digital Marketing in 2026

25 ChatGPT-5.5 Prompts for HR Professionals: Recruitment, Onboarding, Performance Reviews, and Employee Communications

How to Build AI Agents on Amazon Bedrock with GPT-5.6: Step-by-Step Developer Tutorial