How does GPT-5.4 compare to Claude Opus 4.7 on benchmarks?

GPT-5.4 scores 74.9% on SWE-bench Verified and leads on tool-calling reliability at 99.7% schema adherence. However, the article acknowledges GPT-5.4 still loses to Claude Opus 4.7 and Gemini 3.1 Pro in certain tasks. For hard coding and math, GPT-5.4-pro (81.4% SWE-bench) closes that gap significantly.

What is the GPT-5.4 context window and does it degrade?

GPT-5.4 standard and pro offer a 512K context window. Unlike GPT-5.2, which degraded in quality past 200K tokens, GPT-5.4 maintains performance across the full window. GPT-5.4-mini supports 400K and nano supports 256K, though nano's instruction-following degrades noticeably past 32K tokens.

When should developers choose GPT-5.4-pro over GPT-5.4 standard?

Choose GPT-5.4-pro ($15/$60 per million tokens) for hard coding problems, advanced mathematics, and research synthesis requiring its 81.4% SWE-bench score. For standard multi-step reasoning and most production workloads, the standard tier at $2.50/$12.50 delivers sufficient quality at roughly one-sixth the cost.

How does GPT-5.4 prompt caching work and what does it cost?

Prompt caching is automatic and requires no opt-in configuration. Cached input tokens are billed at 25% of the standard input rate. For agent loops that replay the same system prompt thousands of times per hour, this mechanism alone can reduce total inference costs by 40–60% compared to uncached workloads.

What multimodal capabilities does GPT-5.4 support in 2026?

OpenAI released the GPT-5.4-image-2 multimodal endpoint on April 21, 2026, extending the GPT-5.4 family's capabilities beyond text. The article treats this as a distinct endpoint within the GPT-5.4 surface rather than a core feature of the standard variant.

Why did GPT-5.4-nano lose effectiveness past 32K context tokens?

GPT-5.4-nano is optimized for sub-150ms first-token latency and near-realtime classification tasks. Its attention mechanism does not maintain effective recall of subtle system-prompt constraints when long user messages push the system prompt toward the edges of its 256K window, making it unsuitable as a reasoning or agentic model.

How to

Deep Dive: GPT-5.4 Complete Guide u2014 Every Feature, Benchmark, and Use Case in 2026

Markos Symeonides

June 30, 2026

“`html

⚡ TL;DR — Key Takeaways

What it is: GPT-5.4 is OpenAI’s March 2026 production-default model in the GPT-5 family, offering four variants (nano, mini, standard, pro) with a 512K context window, 74.9% SWE-bench Verified score, and automatic prompt caching at 25% of standard input rates.
Who it’s for: Engineering teams and developers building production AI systems who need a reliable, cost-efficient LLM baseline — especially those running agent loops, RAG pipelines, multi-step reasoning workflows, or parallel tool-calling at scale.
Key takeaways: GPT-5.4 uses 38% fewer reasoning tokens than GPT-5.2, delivers 99.7% schema adherence on multi-tool calls, supports prompt caching that cuts costs 40–60%, and fits between GPT-5.4-mini ($0.55/$2.20) and GPT-5.5 ($5/$30) in the price-to-capability stack.
Pricing/Cost: GPT-5.4 standard costs $2.50 input / $12.50 output per million tokens; nano runs $0.15/$0.60; mini $0.55/$2.20; pro $15/$60. Cached input tokens are billed at 25% of the standard input rate automatically.
Bottom line: GPT-5.4 is the pragmatic default for most 2026 productions — cheaper than GPT-5.5 and Claude Opus 4.7 for comparable tasks, more reliable on tool-use than GPT-5.2, and the benchmark most teams use before provider commitment.

GPT-5.4 in 2026: What Actually Shipped, and Why It Matters

Released in March 2026, GPT-5.4 quickly established itself as OpenAI’s go-to model for production AI deployments. Sitting at the balanced intersection of performance, cost, and reliability, it delivers substantial gains over its predecessors, GPT-5.2 and GPT-5.3, making it the preferred choice for enterprises and developers worldwide.

Key highlights include a 512K token context window — a significant increase that genuinely supports extensive, complex document understanding without the degradation issues seen in earlier models beyond 200K tokens. Benchmark performance is strong: 74.9% on SWE-bench Verified and 92.3% on HumanEval demonstrate its versatility across code and natural language tasks. Moreover, latency improvements with sub-400ms median time-to-first-token enhance user experience in real-time applications.

GPT-5.4 offers four different variants—nano, mini, standard, and pro—that cater to varying needs, trading off speed, accuracy, and cost efficiency. This family approach empowers engineering teams to select the precise model that fits their workload profile and budget.

The release also marked a shift in reliability and efficiency. Compared to GPT-5.2, GPT-5.4 uses 38% fewer reasoning tokens to complete the same tasks, courtesy of a redesigned reasoning controller. Tool-use reliability reached a new high with 99.7% schema adherence for multi-tool calls, a critical enhancement for agents orchestrating multiple APIs or functions concurrently.

Another important innovation lies in prompt caching, which automatically bills cached tokens at just 25% of standard input costs. This drastically lowers expenses in agent loops and workflows with repetitive system prompts.

In sum, GPT-5.4 represents a mature, battle-tested foundation for a variety of production-grade AI applications: from conversational agents and RAG pipelines to complex coding assistance and autonomous workflows.

The Four GPT-5.4 Variants and Where Each One Fits

OpenAI’s deployment of GPT-5.4 as a family enables nuanced selections tailored to use case needs. Understanding each variant’s strengths maximizes efficiency and controls cost:

Variant	Input Price (per 1M tokens)	Output Price (per 1M tokens)	Context Window	SWE-bench Score	Best For
GPT-5.4-nano	$0.15	$0.60	256K tokens	41.2%	Classification, intent detection, routing, real-time edge tasks
GPT-5.4-mini	$0.55	$2.20	400K tokens	58.7%	Document synthesis, summarization, simple retrieval-augmented generation (RAG)
GPT-5.4 (Standard)	$2.50	$12.50	512K tokens	74.9%	Production default, multi-step reasoning, complex task workflows
GPT-5.4-pro	$15.00	$60.00	512K tokens	81.4%	High-stakes coding, advanced math, research synthesis requiring extra accuracy

GPT-5.4-nano emphasizes speed and cost efficiency for classification and routing. It is ideal for simple, low-context tasks where rapid decisions at the edge are necessary — such as intent classification or lightweight filtering. However, it degrades quickly when asked to handle detailed instructions over long contexts exceeding 32K tokens.

GPT-5.4-mini strikes a balance suitable for many RAG pipelines. It is cost-effective and supports a 400K token window — ample for synthesizing multiple retrieved documents. Its weaknesses lie in handling intricate multi-tool workflows; error rates rise when sequencing tool calls beyond three steps.

The standard GPT-5.4 variant is the sweet spot for production applications with multi-step reasoning demands and complex tool integration. Its 512K token window remains performant where earlier models lagged. Additionally, it offers the highest tool-use reliability at an accessible price point.

GPT-5.4-pro is a premium option targeted at teams requiring cutting-edge accuracy in demanding scenarios — advanced code generation, scientific research, or financial modeling. While it provides notable benchmark improvements over the standard model, the cost is substantial, necessitating usage only where the improved accuracy justifies the expense.

Code-Specialized Codex Variants

Separate from the general-purpose GPT-5.4 family are the codex-tuned variants optimized for coding contexts. These models, such as GPT-5.4-codex, excel in repository-scale editing, terminal command generation, and multi-file refactoring. They score substantially higher on coding benchmarks (Terminal-Bench ~67.8%) versus the standard GPT-5.4 (~54.1%).

Choosing between plain GPT-5.4 and its codex counterparts depends on workload type. For purely coding-focused applications, codex variants provide clear advantages. For mixed workflows blending code generation with natural language tasks, the general-purpose GPT-5.4 remains a flexible and cost-effective solution.

Structured Outputs, Tool Use, and the New Response API

A major breakthrough in GPT-5.4 is the introduction of guaranteed structured output generation. Previously, schema adherence was probabilistic, often necessitating complex parsing retries or manual validation. Now, OpenAI employs decoding-time constraints enforcing JSON schema compliance structurally, eliminating malformed API calls or tool invocations.

This improvement is critical for developers building multi-tool agents that coordinate external APIs through well-defined JSON requests. Ensuring 100% compliance reduces runtime errors and increases system robustness.

Example: Tool Use with Strict JSON Schema Enforcement

from openai import OpenAI
client = OpenAI()

tools = [{
    "type": "function",
    "function": {
        "name": "search_orders",
        "description": "Search customer orders by date and status",
        "parameters": {
            "type": "object",
            "properties": {
                "start_date": {"type": "string", "format": "date"},
                "end_date": {"type": "string", "format": "date"},
                "status": {
                    "type": "string",
                    "enum": ["pending", "shipped", "delivered", "returned"]
                }
            },
            "required": ["start_date", "end_date"],
            "additionalProperties": false
        },
        "strict": true    # Enforces strict schema adherence at decode time
    }
}]

response = client.responses.create(
    model="gpt-5.4",
    input=[
        {"role": "developer", "content": "You are an order-lookup assistant. Confirm dates before searching."},
        {"role": "user", "content": "Show me returns from last quarter"}
    ],
    tools=tools,
    reasoning={"effort": "medium"},
    response_format={"type": "json_schema", "json_schema": {...}}
)

Key nuances:

Role Hierarchy: The developer role replaces system, signaling instructions that take precedence over user overrides, enhancing prompt injection resistance.
Reasoning Effort: Levels range from minimal to extreme, with the latter available only on the pro variant, allowing granular control over token consumption vs. accuracy.
Strict Schema: Setting strict: true activates the decoder-level enforcement, preventing invalid outputs or API argument errors.

Prompt Caching — A Hidden Cost Saver

GPT-5.4 includes transparent, automatic prompt caching that dramatically reduces input token costs for repeated prefixes. Cached tokens are billed at just 25% of standard rates, with expiration configurable from 5 minutes to 1 hour.

This mechanism is a game changer for agent-based systems where system prompts, few-shot examples, and tool definitions remain constant across many calls. For example, a conversational agent invoking GPT-5.4 100 times per minute with a 4,000-token system prompt can save roughly $180 daily on one deployment.

Adopting prompt caching effectively requires structuring your input so stable content is frontloaded, followed by dynamic user queries. This pattern maximizes cache hits and reduces overall inference spend.

Building a Production Agent with GPT-5.4: A Walkthrough

GPT-5.4 excels as the brain behind powerful multi-tool agents that perform complex workflows like intent classification, planning, API orchestration, and response synthesis. Below is a proven architecture making these scalable and cost-effective.

Intent Classification at the Edge: Use GPT-5.4-nano to classify and route transactions quickly. At low latency and minimal cost, it filters queries that do not require full agent reasoning, reducing heavier model invocations.
Planning with Medium Reasoning Effort: Pass the user query and toolset to GPT-5.4 standard with medium reasoning. Have the model output a structured JSON plan of steps before any actions, facilitating error recovery and transparency.
Parallel Tool Execution: When multiple steps are independent, dispatch calls concurrently to reduce overall latency. GPT-5.4 reliably signals parallelizability, enhancing throughput.
Result Verification: Validate final outputs with GPT-5.4-mini using tailored verification prompts. This step mitigates hallucinations or policy violations before user delivery, preserving trust.
Streaming Final Responses: Stream token outputs to users even if internal steps are batch processed, improving perceived responsiveness by ~600ms on average.
Comprehensive Logging: Store complete call traces from intent through verification keyed by session or request ID. This aids debugging and continuous improvement.

Handling Long Context Effectively

The 512K token window of GPT-5.4 (standard and pro) is a genuine leap forward. Tasks requiring cross-document knowledge, such as large codebase refactoring or multi-document analysis, benefit significantly from this expanded context.

While prompt caching helps control cost on repeated long-context queries, it remains prudent to combine RAG systems with selective retrieval for best economics. Dumping the entire corpus to context is generally inefficient unless inter-chunk relationships are vital.

How GPT-5.4 Compares to Claude Opus 4.7 and Gemini 3.1 Pro

Modern production AI stacks typically route workloads between multiple models optimized for specific scenarios. Here’s a comprehensive comparison based on public benchmarks and field telemetry for 2026:

Get Free Access to 40,000+ AI Prompts for ChatGPT, Claude & Codex

Subscribe for instant access to the largest curated Notion Prompt Library for AI workflows.

Markos Symeonides

OpenAI Codex vs Gemini 3.1 Pro for Solo Developers: Which Should You Choose in 2026?

Posted in How to

Reading Time: 6 minutes

“`html ⚡ TL;DR — Key Takeaways What it is: A comprehensive 2026 analysis comparing OpenAI’s GPT-5.x Codex variants (gpt-5.1-codex through gpt-5.3-codex) against Google’s Gemini 3.1 Pro preview, focused on solo developer coding workflows. Who it’s for: Indie developers, solo engineers,…

How to Build a a Research Assistant with Cursor in 2026: Step-by-Step

Posted in How to

Reading Time: 6 minutes

How to Build a Research Assistant with Cursor in 2026: Step-by-Step ⚡ TL;DR — Key Takeaways What it is: A detailed, comprehensive guide to building a high-performance, agentic research assistant inside Cursor in 2026 using MCP servers, Claude Opus 4.7,…

Setting Up OpenAI Codex for Indie Shipping u2014 Complete Developer Walkthrough

Posted in How to

Reading Time: 9 minutes

“`html ⚡ TL;DR — Key Takeaways What it is: A comprehensive 2026 developer walkthrough for integrating OpenAI’s latest GPT-5.x-codex models into indie AI coding tools, editor plugins, and web applications. Who it’s for: Solo developers and small indie teams building…

Inside OpenAI’s Agentic AI Research Paper: 5 Key Findings That Reveal How AI Work Is Evolving from Chat to Autonomous Execution

Posted in How to

Reading Time: 19 minutes

Inside OpenAI’s Agentic AI Research Paper: 5 Key Findings That Reveal How AI Work Is Evolving from Chat to Autonomous Execution Author: Markos Symeonides, ChatGPT AI Hub Table of Contents Executive Summary Paper Overview and Methodology Finding 1: Codex Replaced…