⚡ TL;DR — Key Takeaways
- What’s inside: 12 production-tested agent architecture patterns documented in 35 pages, with reference stacks, failure modes, and cost profiles for each
- Who it’s for: ML engineers and architects building agent systems past the prototype stage — assumes familiarity with tool calls and LLM orchestration
- What you’ll leave with: a decision tree for pattern selection, composition guide for stacking 3-5 patterns, and concrete model recommendations per stage (GPT-5.1, Opus 4.7, Gemini 3.1)
- Length and depth: ~35 pages, 12 pattern chapters plus a primitives chapter and a composition guide, all in prose with named real-world systems (Claude Code, Cursor, Devin, Fin, Harvey)
- Cost: free with chatgptaihub.com signup — part of the premium subscriber library

Why most agent architectures fail in production
If you have shipped an agent to production in the last eighteen months, you already know the dirty secret of the agentic AI boom: the gap between a working demo and a working system is enormous, and most teams cross it by accident or not at all. The first wave of agent frameworks — AutoGPT, BabyAGI, the early ReAct loops — optimized for the demo. They chained LLM calls inside a single Python process, stored state in memory, and assumed the model would self-correct. In production, they burned tokens on stuck loops, broke on malformed JSON, and were impossible to debug because nothing was replayable.
By late 2025, the teams shipping agents at scale converged on a different shape. Claude Code, Cursor’s background agents, Devin, Replit Agent v3, Intercom Fin, Harvey, Perplexity — they all look architecturally similar, because they all solved the same set of operational problems. State lives outside the model. Every step is a durable event. Tools have typed contracts. Verification stops error compounding. Humans are checkpoints, not fallbacks.
Those convergent practices form a catalog of about a dozen reusable patterns. Our new flagship playbook, Agentic Workflow Design Patterns, documents all twelve in a single 35-page reference. It is written for the engineer who is past the demo and trying to figure out which architectural moves are worth the complexity for their specific problem.
Below is a taste of four of the patterns. The full playbook is free with a chatgptaihub.com signup, and is the most concrete operational reference we have published.
Pattern preview: Plan-Execute-Verify is doing most of the heavy lifting
If you look at the internals of every serious coding agent in 2026 — Claude Code, Cursor’s agent mode, Devin, Replit Agent — they all converge on a three-stage split: a planner produces a structured plan, an executor runs each step, a verifier confirms each step before moving on. This is Pattern 5 in the playbook, and it is the single highest-leverage architectural decision in the catalog.
The reason the split matters is that the three stages have different requirements. Planning rewards deep reasoning over broad context — this is where you spend on GPT-5.1 Pro or Claude Opus 4.7. Execution rewards speed and precise tool use — Claude Sonnet 4.6 or GPT-5.1 are usually right. Verification rewards focused, criteria-driven checks — Haiku 4.5 or Gemini 3.1 Flash handle most cases. Running the strongest model end-to-end is the default, and it is the wrong default. The PEV split typically cuts cost 50 to 70 percent while improving reliability.
The chapter walks through plan structure (the contract between planner and executor), replanning policy, and the three flavors of verifier — deterministic, LLM-as-judge, and human. It includes the math on why verification climbs end-to-end success from 43 percent to 85+ percent on an eight-step task, and the trade-offs of where to spend on which verifier flavor for which workload.
If your agent currently uses one model for everything and a single retry loop, this pattern alone is probably worth a sprint of refactoring.

Pattern preview: Router-Workers is how multi-skill agents stay shippable
Pattern 3 in the playbook is Router-Workers, and it is the dominant pattern for any agent that handles more than one type of task. A small, fast router model classifies the incoming request and dispatches to one of N specialized worker agents. The router does not solve the problem. It picks the right worker and hands off the context.
Intercom Fin v3 uses this with a Haiku 4.5 router over twelve classes, dispatching to specialized refund, shipping, product, and account-management workers. Zendesk, Linear, and Shopify Sidekick run similar shapes. The reason the pattern wins is operational, not algorithmic: it lets every worker be independently versioned, evaluated, deployed, and even staffed by a different team. Your refund worker team can swap models without coordinating with your product-question worker team. You can canary new workers with weighted routing. You can deprecate a worker by routing zero traffic to it.
The chapter covers the design constraints that make this work: how many classes is the right number (six to fifteen, mutually exclusive, with a mandatory unknown class), which model to use for the router (the cheapest one that hits 95 percent on your eval set — not your strongest), how to handle confidence and ambiguity, and the discipline of treating worker contracts as a typed interface rather than a shared codebase. We also walk through the per-route cost tracking practice that most teams miss until their bill surprises them.
If you are about to add a third task type to a single-agent architecture, read this chapter first. It will save you months of refactoring.
Pattern preview: Sandboxed Tool Execution is the security pattern under everything
Pattern 7 in the playbook is one of the operational patterns that does not get much airtime in product demos but underpins every agent that writes code, modifies files, or touches production systems. The principle is simple: the agent never executes code in the same trust domain as the orchestrator. Ephemeral containers or microVMs (E2B, Modal, Daytona, Firecracker-backed stacks) give you per-run filesystem isolation, capped CPU and memory, network egress policies, and a kill-switch on wall-clock or token-cost overruns.
The chapter covers the two security mistakes that show up in almost every team’s first attempt — unbounded network egress and shared credentials — and the policies that fix them. Default-deny egress with an explicit allowlist of domains the agent may reach. Scoped, short-lived tokens minted per operation rather than long-lived secrets handed to the agent. Your agent’s compromise blast radius then equals the maximum harm of the most powerful token it ever held, which should be small.
It also covers the cost-control story, which is where sandboxes most often fail expensively in practice. A coding agent that gets into a tight test loop can burn through compute spend faster than LLM spend. The chapter details the four caps every production sandbox enforces (wall-clock, CPU-time, memory, step count), the run-hygiene practice of destroying containers between runs, and the long-term storage decision for replay artifacts.
This pattern composes with almost every other pattern in the playbook. PEV executes steps in sandboxes. ReAct makes tool calls into sandboxes. Even Router-Workers sometimes routes into a sandboxed worker. Teams that treat sandboxing as a first-class platform — with its own SRE rotation and its own metrics — are the teams whose agents scale past the prototype phase.

Pattern preview: Memory architectures are three problems, not one
Pattern 8 is the one most teams botch on the first attempt, because they treat memory as a single problem. It is three problems. Short-term memory holds the current run’s working context — tool results, intermediate reasoning, current goal — and lives in the workflow engine’s event log. Long-term memory holds facts about the user or domain that should persist across runs and lives in a structured store. Episodic memory holds summaries or embeddings of past interactions the agent can retrieve when relevant and lives in a vector store with metadata.
The chapter walks through each layer’s right schema, retrieval pattern, and write strategy — the last of which is the genuinely hard problem. What does the agent decide to remember, when, and in what form? We document the three write strategies that work in production (explicit user-driven, agent-driven with verification, and summary writes), with examples from how Mem0, Zep, Letta, ChatGPT memory, and Anthropic’s Projects memory each handle the choice.
It also covers the privacy story, which is non-negotiable in 2026. Long-term memory contains user data, which means deletion requests, data residency, and consent revocation. The architecture has to support all three from day one — every memory record needs a user ID, a created-at, a source pointer, and a deletion API that purges from both the structured store and the vector index. Teams that bolt this on later spend months untangling it. A few end up in regulatory trouble.
The chapter closes with a multi-session eval methodology that almost no team uses but every team should. If your memory system has not been tested against scenarios that establish a fact, intervene with other sessions, and then check recall, you do not know if it works.
The other seven patterns, and how to read the playbook
The four patterns above are previews. The full playbook documents twelve, each with a when-to-use section, an architecture in prose, a reference implementation stack, the failure modes, and the cost profile. The other eight: Prompt Chaining with Gates (the simplest durable pattern), Bounded ReAct Loops (with hard budgets and termination guarantees), Iterative Refinement (reflect-revise with concrete rubrics), Tool-Augmented RAG (the replacement for naive retrieval pipelines), Human-in-the-Loop Checkpoints (designed-in, not fallback), Parallel Fan-Out and Fan-In (for latency and ensembling), Multi-Agent Debate (the most over-applied pattern, used correctly), and the Evaluator-Optimizer Loop for continuous improvement.
The closing chapter ties them together with a one-paragraph decision tree and a composition guide showing how real systems combine three to five of the patterns. Cursor’s agent mode is roughly Plan-Execute-Verify plus Sandboxed Tool Execution plus Human-in-the-Loop, with the Evaluator-Optimizer loop closing things internally. Intercom Fin is Router-Workers plus Prompt Chaining in most workers plus Human-in-the-Loop on escalation. The art is not picking one pattern. It is composing the right two or three.
The playbook is 35 pages, written for ML engineers and architects who are past prototyping and shipping to real users. It assumes you know what a tool call is and have at least one agent in production or close to it. It is free with a chatgptaihub.com signup, alongside the rest of our subscriber library — model comparison deep dives, stack-specific guides, and production case studies.
If you are about to start a new agent project, or you are halfway through one and feeling the architectural weight, the decision tree at the end of the playbook is worth printing and pinning above your desk. Sign up below to read the full reference.
⚡ PREMIUM DROP · FREE WITH SIGNUP
Download the full Agentic Workflow Design Patterns — FREE
13 chapters · 39+ pages of actionable playbook for AI professionals. Plus full access to our 40,000+ prompt library. Instant email delivery.
No spam. Instant PDF delivery. Unsubscribe anytime.
Frequently Asked Questions
What exactly is in the 35-page playbook?
Twelve reference architectures for production agent systems, each documented as a full chapter with when to use the pattern, the architecture in prose, the reference implementation stack (workflow engine, models per stage, storage), the dominant failure modes, the cost profile, and a worked example from a shipping system. Plus an opening chapter on the four primitives that recur across every pattern, and a closing chapter with a decision tree and composition guide showing how real systems combine three to five patterns. About 25,000 words total.
Who is this playbook actually for?
Senior ML engineers, AI engineers, and architects building agent systems for production — not prototypes. The reader is expected to know what a tool call is, what a vector store is, and roughly how a workflow engine like Temporal or Inngest works. If you are choosing between LangGraph and rolling your own, deciding how to split your planner and executor, or trying to figure out why your ReAct loop hits 90 percent in eval and 65 percent in prod, this playbook is written for you.
How is this different from the LangChain or Anthropic agent guides?
Framework documentation tells you how to use a framework. This playbook tells you which architectural patterns are surviving production in 2026 and the trade-offs between them, framework-agnostically. We name specific tools where they are relevant (Temporal, Inngest, E2B, Mem0, Braintrust), but the patterns transfer across stacks. We also write from the operational side — failure modes, cost profiles, eval methodology — which framework docs typically skip.
Are the model recommendations going to be stale in three months?
The specific model recommendations will shift as new versions release — that is unavoidable in this market. The patterns themselves are durable because they reflect engineering constraints of building reliable systems on top of probabilistic components. We update the playbook quarterly with the current model recommendations per stage, and subscribers get the updates automatically. The 2026 edition reflects the GPT-5.1, Claude 4.6/4.7, and Gemini 3.1 generation.
How do I actually get the PDF?
Sign up for a free chatgptaihub.com account using the form on this post. You will get the PDF immediately along with access to the rest of the premium subscriber library, which includes model comparison deep dives, stack-specific implementation guides, and production case studies. No credit card. The weekly briefing also lands in your inbox with new pattern additions and case studies.
What should I read after this playbook?
Three follow-ups depending on your stack. If you are implementing PEV or ReAct in production, our deep dive on workflow engines for AI agents (Temporal vs Inngest vs Restate vs LangGraph) is the next read. If you are building retrieval-heavy agents, our 2026 RAG architecture playbook covers retrieval design in the depth this playbook does not. And if you are setting up the eval loop, our agent evaluation methodology guide is the operational companion to Pattern 12. All three are in the subscriber library.
