How to Build Multi-Agent Teams with OpenAI’s Agent-Team Feature: Preventing Role Conflicts and Managing Agent Hierarchy
How to Build Multi-Agent Teams with OpenAI’s Agent-Team Feature: Preventing Role Conflicts and Managing Agent Hierarchy
Table of Contents
- Introduction: Why Multi-Agent Teams and Why Now
- Multi-Agent Architectures: Concepts and Patterns
- OpenAI’s Agent-Team Feature: Concepts, Capabilities, and Mechanics
- Designing Agent Roles and Responsibilities
- Preventing Role Conflicts and Role Drift
- Implementing Hierarchy and Delegation Patterns
- Communication Protocols Between Agents
- Error Handling and Resilience in Multi-Agent Systems
- Practical Examples with Python: 3-Agent Team (Planner, Executor, Reviewer)
- Testing Multi-Agent Workflows
- Monitoring, Logging, and Observability
- Production Deployment Considerations
- Conclusion and Next Steps
Introduction: Why Multi-Agent Teams and Why Now
For years, single-agent patterns dominated LLM applications: one model with a well-engineered prompt, a handful of tools, and careful orchestration. As workloads evolved—from one-shot generation to multi-step, cross-domain reasoning and tool use—teams of specialized agents have emerged as a pragmatic way to tame complexity. A team of agents mirrors real-world engineering processes: a planner who decomposes tasks, an executor who implements steps with the right tools, and a reviewer who enforces quality and safety standards.
OpenAI’s agent-team feature operationalizes this pattern. It provides structured roles, controlled delegation, and team-level policies so your system can scale from a single assistant to a coordinated, auditable pipeline of experts. Yet, as in real teams, success hinges on clear role boundaries and a robust hierarchy. Without them, agents can “role drift,” overstep their authority, or begin to “appoint themselves” as leaders, causing cascading errors and unpredictable behavior.
This tutorial offers a complete, practical guide to building multi-agent teams with OpenAI’s agent-team feature, with strong emphasis on:
- Designing precise roles and boundaries
- Preventing role conflicts and leader-election pathologies
- Implementing hierarchy and delegation patterns
- Defining communication protocols among agents
- Error handling, testing, monitoring, and productionization
We will implement a three-agent pattern—planner, executor, and reviewer—with complete Python examples, and we will show how to enforce role boundaries at multiple layers: instruction design, tool permissions, policies, and runtime checks.
Related topics worth exploring as you scale include For a deeper exploration of this topic, our comprehensive guide on The Codex Microservices Playbook: 20 Prompts for Designing, Implementing, and Testing Distributed Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section., For a deeper exploration of this topic, our comprehensive guide on The Complete Guide to Google’s Agent2Agent Protocol and OpenAI Codex Interoperability: Building Cross-Platform AI Agent Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section., and For a deeper exploration of this topic, our comprehensive guide on The Codex Application Modernization Playbook: 15 Prompts for Migrating Legacy Systems to Cloud-Native Architecture provides detailed strategies and implementation frameworks that complement the approaches discussed in this section..
Multi-Agent Architectures: Concepts and Patterns
A multi-agent system decomposes a complex task into specialized components. Each agent is a policy (LLM + instructions + tools) responsible for a subset of the workflow. The key advantages include:
- Separation of concerns: each agent focuses on a narrow set of responsibilities.
- Better safety and auditability: reviewers and gatekeepers are explicit roles.
- Performance: parallelization, caching, and specialization reduce end-to-end latency and cost.
- Operational clarity: when things fail, you know which role failed and why.
Common Patterns
- Planner–Executor–Reviewer (PER): a canonical triad where the planner decomposes work, the executor performs steps with tools, and the reviewer checks outputs for quality and policy compliance.
- Coordinator–Specialists: a coordinator agent routes sub-tasks to one of several specialists (e.g., data analysis, retrieval, drafting) and aggregates results.
- Supervisor Trees: hierarchical supervision where sub-teams report to mid-level supervisors who aggregate and escalate to a top-level orchestrator.
- Critic–Proposer Loops: a proposer generates artifacts; a critic iteratively improves them with constraints and acceptance criteria.
Key Risks in Multi-Agent Designs
- Role drift: agents stray from their remit, e.g., executor attempts to redefine the plan.
- Leader escalation: an agent tries to assume coordination authority (“I will manage the team”).
- Circular delegation: agents bounce tasks back and forth without termination.
- Over-delegation: proliferation of unnecessary sub-tasks that increase cost and latency.
- Ambiguous interfaces: agents exchange vague messages, creating inconsistencies and rework.
OpenAI’s Agent-Team Feature: Concepts, Capabilities, and Mechanics
OpenAI’s agent-team feature streamlines the creation and management of role-based multi-agent systems. It enables:
- Defining agents as reusable assets with instructions, tools, and metadata (e.g., role, capabilities).
- Forming teams with membership, hierarchy, and policies for delegation and tool-use boundaries.
- Running team sessions that coordinate message passing, routing, and execution lifecycles.
- Injecting team-level guardrails for role enforcement and safety checks.
Core Concepts
- Agent: a configured LLM persona with tools, system instructions, and policy metadata.
- Team: a set of agents plus coordination rules (who can delegate, to whom, and when).
- Session: a runtime container for a task, holding messages, tool outputs, artifacts, and state.
- Policy: declarative or programmatic rules describing allowed actions and escalation conditions.
- Channel: a logical communication pathway (plan, execute, review, control) with message schemas.
Lifecycle
- Define agents and their tools.
- Create a team with hierarchy and delegation policies.
- Open a session for a user request.
- Send the initial message to the team; the planner forms a plan.
- The team routes tasks to executors; reviewer validates outputs.
- Session terminates with a final artifact or response.
SDK Snapshot (Python)
Below we will show two pathways:
- Using agent-team primitives to declare a team and run a coordinated session.
- A compatibility orchestration layer that achieves the same pattern if your SDK version doesn’t expose agent-team helpers directly. This is useful for experimentation, offline unit testing, or custom routing.
Designing Agent Roles and Responsibilities
Start with a crisp role charter, akin to a software team’s RACI (Responsible, Accountable, Consulted, Informed). The following matrix illustrates clear boundaries for a PER team:
| Role | Primary Responsibilities | Explicit Non-Responsibilities | Allowed Tools | Escalation Paths |
|---|---|---|---|---|
| Planner | Decompose user request into steps; define acceptance criteria; assign steps to executor(s). | Does not run code; does not write files; does not change outputs after reviewer approval. | Search/Retrieval (read-only); planning aides (no code execution). | Can request reviewer to validate plan; can re-plan after reviewer feedback. |
| Executor | Perform tool calls; implement steps; produce artifacts; track intermediate state. | Does not change plan scope; does not approve final outputs; cannot appoint itself as leader. | Code interpreter; sandboxed shell; file I/O (sandbox); APIs needed for execution. | Requests clarifications from planner; submits for review to reviewer. |
| Reviewer | Validate correctness, safety, and policy compliance; request fixes; approve/reject. | Does not execute tools; does not decompose tasks; cannot re-architect the plan. | None (LLM-only) or read-only tools. | Can send change requests to executor or planner; final approval authority. |
Boundary-Centric Instructions
Each agent’s system prompt must explicitly encode boundaries and escalation rules. Make boundaries machine-checkable by:
- Stating disallowed behaviors (“Do not execute code” for planner).
- Referencing channels (“Use only the plan channel to emit plans”).
- Embedding structured output schemas (JSON) that your orchestrator can validate.
- Documenting escalation conditions with keywords that policies can recognize (“REQUEST_REPLAN”, “REQUEST_FIX”).
Tool Permission Scoping
Minimize tool surfaces to reduce attack area and control cost. Capabilities are granted per agent:
- Planner: retrieval-only tools; no direct network or code execution.
- Executor: code interpreter, sandbox file system; restricted external calls through vetted functions.
- Reviewer: LLM-only or read-only validation tools (e.g., schema validators).
Preventing Role Conflicts and Role Drift
Preventing role conflicts is the spine of multi-agent reliability. When an agent oversteps its role—especially trying to coordinate others—the system’s invariants collapse. Protect against this at four layers:
- Instruction-level constraints
- Tool permission boundaries
- Team-level policies (declarative and programmatic)
- Runtime checks and auditing
1) Instruction-Level Constraints
- Positively state responsibilities and negatively state prohibitions.
- Use strong, unambiguous language: “You must not…”
- Include structured outputs with explicit channels and message types.
2) Tool Boundaries
- Design toolkits per role; never share executor-level tools with planner or reviewer.
- Run-time deny tool invocations from non-privileged agents; return policy errors.
3) Team-Level Policies
Implement role enforcement policies that inspect messages for “leadership” phrases or attempt to delegate beyond the allowed graph. Use a fixed “leadership token” concept—only specific roles possess a virtual token to emit control-plane messages. Others are blocked by policy.
4) Runtime Checks and Auditing
- All messages tagged with agent_id, role, channel, message_type.
- Validate each message against the team policy at ingress and egress.
- Emit metrics: role_conflict_detected, blocked_delegations, policy_violations.
Leader-Escalation and Role-Drift Detectors
Use simple but effective detectors:
- Phrase detector: block messages containing “I will coordinate”, “I will act as manager”, “I will assign agents”.
- Channel misuse: any executor message on control/plan channel triggers a violation.
- Delegation graph check: verify target_of_delegation ∈ allowed_successors[agent.role].
Implementing Hierarchy and Delegation Patterns
A hierarchy creates clarity: who plans, who executes, who approves. Delegation patterns refine how work flows.
Topologies
- Star (central planner): all execution routed from planner to executors; reviewer approves.
- Tree (supervisors): intermediate supervisors manage sub-teams, allowing scale beyond a single planner.
- DAG of specialists: tasks can be routed to different specialists; reviewer remains a terminal gate.
Delegation Rules
- Planner can delegate to executor(s) and request reviewer validation of the plan.
- Executor cannot delegate to planner except for clarifications (“CLARIFY” message).
- Executor must submit artifacts to reviewer; reviewer either approves or requests fixes.
- Only reviewer can signal “FINAL_APPROVAL”.
Termination Conditions
- Reviewer issues FINAL_APPROVAL and returns artifacts to caller.
- Planner explicitly cancels session with CANCEL(reason) if irreconcilable constraints arise.
- Timeouts or policy violations can abort the run; artifacts stored for triage.
Communication Protocols Between Agents
Define strict envelopes and channels for inter-agent messages to make intent parseable and auditable.
Message Envelope
{
"session_id": "uuid",
"message_id": "uuid",
"timestamp": "RFC3339",
"from": {"agent_id": "uuid", "role": "planner|executor|reviewer"},
"to": {"agent_id": "uuid|team", "role": "executor|reviewer|planner|team"},
"channel": "plan|execute|review|control",
"message_type": "PLAN|TASK|ARTIFACT|REVIEW_REQUEST|REVIEW|APPROVAL|REJECT|CLARIFY",
"correlation_id": "uuid-of-parent",
"payload": {},
"policy": {
"delegation_allowed": true|false,
"tools_scope": ["code_interpreter", "..."],
"safety_flags": ["no_network", "read_only"]
}
}
Channels
- plan: planner emits plans, updates, and re-plans.
- execute: executor receives tasks and posts artifacts.
- review: reviewer receives artifacts and emits REVIEW/APPROVAL/REJECT.
- control: reserved for orchestration (session start, abort, metrics).
Structured Payload Schemas
Examples:
// PLAN payload
{
"steps": [
{ "id": "S1", "description": "Load CSV", "acceptance": "Dataframe loaded" },
{ "id": "S2", "description": "Compute aggregates", "acceptance": "Summary table JSON" }
],
"constraints": ["no_network", "max_runtime:120s"],
"handoff_to": "executor"
}
// TASK payload
{
"plan_step_id": "S1",
"instructions": "Load CSV from provided content",
"inputs": { "csv": "<string>" },
"expected_outputs": ["dataframe_preview.png", "schema.json"]
}
// REVIEW payload
{
"artifact_ids": ["A1", "A2"],
"checks": ["correctness", "policy_compliance"],
"verdict": "approve|reject",
"notes": "..."
}
Error Handling and Resilience in Multi-Agent Systems
Error handling must be role-aware and policy-driven. Treat failures as first-class events with structured semantics.
Error Taxonomy
- Tool Errors: failures in code interpreter, network, or external APIs.
- Policy Violations: role conflicts, forbidden tools, unsafe outputs.
- Protocol Errors: malformed message envelopes or channels misuse.
- Timeouts: agent exceeds allocated wall-clock time.
- Content Risks: hallucinations, PII leakage, toxicity, or unsafe code.
Strategies
- Retries with jitter for transient tool errors; strict caps to avoid cost blowups.
- Fallbacks: switch models or simplify tasks with reduced scope.
- Circuit breakers: temporarily block tools or agents after repeated failures.
- Escalation: notify planner for re-plan after executor failures; reviewer for policy violations.
- Idempotency: include correlation_id to deduplicate retried steps.
Practical Examples with Python: 3-Agent Team (Planner, Executor, Reviewer)
This section provides end-to-end code. We present:
- A team declared and run via agent-team primitives.
- A compatible orchestration layer for environments without native agent-team support (helpful for testing and custom routing).
Prerequisites
- Python 3.9+
- OpenAI Python SDK (latest)
- Optional: opentelemetry-sdk for tracing (used later)
Environment Setup
pip install openai opentelemetry-sdk pydantic rich
A) Declaring a Team with Agent-Team Primitives
The following example uses agent-team constructs to define roles, assemble a team, and run a session. If your SDK version differs, adapt names accordingly or use the compatibility orchestrator shown afterward.
from typing import Dict, Any, List, Optional
import json
import time
import uuid
from openai import OpenAI
client = OpenAI()
# 1) Define strict system prompts with role boundaries and structured outputs.
PLANNER_SYS = """
Role: Planner
You are responsible for decomposing the user's request into an executable plan of steps.
You must NOT execute code, call tools that write files, or assume leadership of other agents.
Output only JSON with the schema:
{
"message_type": "PLAN",
"steps": [
{ "id": "S<n>", "description": "...", "acceptance": "..." }
],
"constraints": ["..."],
"handoff_to": "executor"
}
Use only the 'plan' channel. Never emit control messages.
If execution fails, propose a minimal REPLAN with changed steps.
"""
EXECUTOR_SYS = """
Role: Executor
You perform the plan's steps using ONLY the tools granted to you.
You must NOT modify the plan's scope or assume leadership.
You must NOT emit PLAN messages. Use 'execute' channel exclusively.
On completion of each step, output JSON:
{
"message_type": "ARTIFACT",
"step_id": "S<n>",
"artifacts": [{ "id": "A<n>", "type": "file|image|text|json", "uri": "..." }],
"notes": "..."
}
"""
REVIEWER_SYS = """
Role: Reviewer
You validate correctness and policy compliance. You have no tool access.
You must NOT execute or modify the plan; do not create new tasks.
Output only JSON with schema:
{
"message_type": "REVIEW",
"verdict": "approve|reject",
"issues": ["..."],
"requests": ["..."] # suggestions for fixes
}
Use the 'review' channel.
"""
# 2) Create agents with scoped tools.
# For illustration, the executor has a code interpreter. Planner and Reviewer are LLM-only.
planner = client.agents.create(
name="Planner",
model="gpt-4o-mini",
instructions=PLANNER_SYS,
tools=[], # planner has no execution tools
metadata={"role": "planner", "channels": ["plan"]}
)
executor = client.agents.create(
name="Executor",
model="gpt-4o",
instructions=EXECUTOR_SYS,
tools=[{"type": "code_interpreter"}], # limited toolset
metadata={"role": "executor", "channels": ["execute"]}
)
reviewer = client.agents.create(
name="Reviewer",
model="gpt-4o-mini",
instructions=REVIEWER_SYS,
tools=[], # reviewer has no tools
metadata={"role": "reviewer", "channels": ["review"]}
)
# 3) Define team-level policy for delegation and role enforcement.
# Policies include allowed delegations and prohibited phrases for leadership escalation.
team_policy = {
"delegation_graph": {
"planner": ["executor", "reviewer"],
"executor": ["reviewer"], # executor can only send to reviewer
"reviewer": [] # reviewer never delegates
},
"prohibited_leadership_phrases": [
"I will coordinate", "I will manage the team", "I will assign agents"
],
"allowed_channels": {
"planner": ["plan"],
"executor": ["execute"],
"reviewer": ["review"]
},
"deny_on_channel_misuse": True
}
# 4) Create the team with hierarchy encoded.
team = client.agent_teams.create(
name="PER Team",
members=[planner.id, executor.id, reviewer.id],
hierarchy={
"leader": planner.id, # logical leader for initiating plans (not for execution)
"review_gate": reviewer.id
},
policy=team_policy
)
# 5) Open a session and run a task.
session = client.agent_teams.sessions.create(
team_id=team.id,
metadata={"purpose": "demo-per", "user": "example_user"}
)
user_task = "Given a CSV of monthly sales (provided below), compute per-region totals and produce a brief summary. CSV:\nregion,month,sales\nEMEA,1,100\nNA,1,200\nEMEA,2,150\nNA,2,180\nAPAC,1,90\nAPAC,2,95"
# Start the run: planner generates the plan; executor executes; reviewer reviews.
run = client.agent_teams.runs.create(
team_id=team.id,
session_id=session.id,
input=user_task,
stream=True, # stream events for fine-grained control
)
# 6) Consume stream and enforce runtime policies (defense in depth).
for event in run:
# Typical event types: "message", "tool_call", "artifact", "policy_violation", "completed"
et = event["type"]
if et == "message":
msg = event["data"]
role = msg["from"]["role"]
channel = msg["channel"]
# Channel enforcement:
if channel not in team_policy["allowed_channels"][role]:
client.agent_teams.runs.action(
team_id=team.id, session_id=session.id, action="block", reason="channel_misuse"
)
continue
# Leadership phrase detection:
text = msg.get("text", "")
for phrase in team_policy["prohibited_leadership_phrases"]:
if phrase.lower() in text.lower():
client.agent_teams.runs.action(
team_id=team.id, session_id=session.id, action="block", reason="leadership_escalation"
)
break
elif et == "tool_call":
# Verify agent role vs tool permissions
call = event["data"]
role = call["from"]["role"]
if role != "executor":
client.agent_teams.runs.action(
team_id=team.id, session_id=session.id, action="deny_tool", reason="role_not_allowed"
)
# else: allow; agent-team runtime will sandbox accordingly
elif et == "artifact":
# Forward artifacts to reviewer automatically as per policy
artifact = event["data"]
client.agent_teams.messages.create(
team_id=team.id, session_id=session.id,
to={"role": "reviewer", "agent_id": reviewer.id},
channel="review",
payload={
"message_type": "REVIEW_REQUEST",
"artifact_ids": [artifact["id"]],
"notes": "Auto-submitted by policy"
}
)
elif et == "policy_violation":
# Log and potentially terminate
violation = event["data"]
print("Policy violation:", violation)
client.agent_teams.runs.action(
team_id=team.id, session_id=session.id, action="abort", reason="policy_violation"
)
break
elif et == "completed":
result = event["data"]
print("Run completed. Summary:")
print(json.dumps(result, indent=2))
break
The example above demonstrates defense-in-depth: agent instructions, tool scoping, team policy, and live runtime checks. Even if a model tries to deviate, your orchestration code enforces boundaries.
B) Compatibility Orchestrator (Manual Routing with Agents)
If agent-team helpers are not available in your environment, you can implement the same semantics with explicit orchestration. This gives you flexibility in testing and makes the policy logic transparent.
from typing import Dict, Any, Tuple
import json
import uuid
from openai import OpenAI
client = OpenAI()
def make_message(from_role: str, to_role: str, channel: str, message_type: str, payload: Dict[str, Any]) -> Dict[str, Any]:
return {
"session_id": str(uuid.uuid4()),
"message_id": str(uuid.uuid4()),
"from": {"role": from_role},
"to": {"role": to_role},
"channel": channel,
"message_type": message_type,
"payload": payload
}
# Create three "agents" as call patterns with instructions and optional tools.
def call_planner(user_task: str) -> Dict[str, Any]:
resp = client.responses.create(
model="gpt-4o-mini",
input=[
{"role": "system", "content": PLANNER_SYS},
{"role": "user", "content": f"USER_TASK:\n{user_task}\nOutput JSON only."}
],
)
text = resp.output_text
return json.loads(text)
def call_executor(step: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
# Provide the code interpreter tool via function-calling pattern
tools = [{
"type": "function",
"function": {
"name": "python_exec",
"description": "Execute Python code in a sandbox and return stdout and generated files",
"parameters": {
"type": "object",
"properties": {"code": {"type": "string"}},
"required": ["code"]
}
}
}]
messages = [
{"role": "system", "content": EXECUTOR_SYS},
{
"role": "user", "content": json.dumps({
"step": step,
"context": context,
"constraints": ["no_network"]
})
}
]
# We will iteratively resolve tool calls until the model returns an ARTIFACT message.
state = {"artifacts": []}
while True:
resp = client.responses.create(model="gpt-4o", input=messages, tools=tools, tool_choice="auto")
for item in resp.output:
if item.type == "message":
content = getattr(item, "content", [])
if content:
text = content[0].get("text", {}).get("value")
try:
data = json.loads(text)
if data.get("message_type") == "ARTIFACT":
return data
except Exception:
# If it's non-JSON text, continue loop
messages.append({"role": "assistant", "content": text})
elif item.type == "tool_call":
tc = item
if tc.name == "python_exec":
code = tc.arguments.get("code", "")
# For demo, we simulate a sandbox. In production, secure a real sandbox.
try:
# DANGEROUS: For demonstration only. Replace with real sandbox.
local_ns = {}
exec(code, {}, local_ns)
result = str(local_ns.get("result", ""))
except Exception as e:
result = f"ERROR: {e}"
tool_result = {
"tool_name": "python_exec",
"result": result
}
messages.append({"role": "tool", "name": "python_exec", "content": json.dumps(tool_result)})
# Loop to let the assistant produce ARTIFACT after handling tool outputs
def call_reviewer(artifacts: Dict[str, Any], plan: Dict[str, Any]) -> Dict[str, Any]:
resp = client.responses.create(
model="gpt-4o-mini",
input=[
{"role": "system", "content": REVIEWER_SYS},
{"role": "user", "content": json.dumps({"plan": plan, "artifacts": artifacts})}
]
)
text = resp.output_text
return json.loads(text)
def enforce_policy(from_role: str, to_role: str, channel: str, text: str) -> Tuple[bool, str]:
# Basic enforcement: channels and leadership phrases
if channel not in team_policy["allowed_channels"][from_role]:
return False, "channel_misuse"
for phrase in team_policy["prohibited_leadership_phrases"]:
if phrase.lower() in text.lower():
return False, "leadership_escalation"
# Delegation graph
if to_role not in team_policy["delegation_graph"][from_role] and to_role != "team":
return False, "invalid_delegation"
return True, "ok"
def run_session(user_task: str) -> Dict[str, Any]:
# 1) Planner creates plan
plan = call_planner(user_task)
msg = make_message("planner", "executor", "plan", "PLAN", plan)
ok, reason = enforce_policy("planner", "executor", "plan", json.dumps(plan))
if not ok:
return {"status": "aborted", "reason": reason}
# 2) Executor runs steps
artifacts = {"message_type": "ARTIFACTS", "items": []}
for step in plan["steps"]:
art = call_executor(step, context={"user_task": user_task})
exec_msg = make_message("executor", "reviewer", "execute", "ARTIFACT", art)
ok, reason = enforce_policy("executor", "reviewer", "execute", json.dumps(art))
if not ok:
return {"status": "aborted", "reason": reason}
artifacts["items"].append(art)
# 3) Reviewer validates
review = call_reviewer(artifacts, plan)
rev_msg = make_message("reviewer", "team", "review", "REVIEW", review)
ok, reason = enforce_policy("reviewer", "team", "review", json.dumps(review))
if not ok:
return {"status": "aborted", "reason": reason}
if review["verdict"] == "approve":
return {"status": "ok", "artifacts": artifacts, "review": review}
else:
# Optionally trigger replanning
repl = {"message_type": "REPLAN_REQUEST", "issues": review["issues"], "requests": review["requests"]}
return {"status": "needs_replan", "review": review, "replan": repl}
if __name__ == "__main__":
task = "Given a CSV of monthly sales (provided below), compute per-region totals and produce a brief summary. CSV:\nregion,month,sales\nEMEA,1,100\nNA,1,200\nEMEA,2,150\nNA,2,180\nAPAC,1,90\nAPAC,2,95"
result = run_session(task)
print(json.dumps(result, indent=2))
This orchestrator reinforces policy at every hop and shows how to make role boundaries explicit and machine-enforced. It is straightforward to substitute real sandboxes and production-grade tooling for the mocked function tool handler.
Role Boundary Highlights in the Example
- Planner cannot call tools and only emits a PLAN JSON object on the plan channel.
- Executor can call tools, but cannot send anything on the plan channel.
- Reviewer cannot call tools and can only emit a REVIEW JSON object.
- Delegation graph disallows executor→planner delegation except with explicit CLARIFY rules (left out here for brevity but easy to add as a channel and message_type).
Extensions
- Multiple executors: route steps by capability tags (e.g., “python”, “nlp”, “viz”).
- Reviewer variants: compliance reviewer vs. technical reviewer chained in sequence.
- Caching: cache plan fragments and executor artifacts to reduce cost.
- Long-running tasks: persist session state and artifacts in object storage.
Testing Multi-Agent Workflows
Testing is where multi-agent systems become tractable. Build confidence with multiple test layers:
Unit Tests for Policies and Prompts
Verify policy logic without incurring model calls.
import pytest
def test_enforce_policy_allows_valid_paths():
text = json.dumps({"message_type": "PLAN"})
ok, reason = enforce_policy("planner", "executor", "plan", text)
assert ok and reason == "ok"
def test_enforce_policy_blocks_channel_misuse():
text = json.dumps({"message_type": "ARTIFACT"})
ok, reason = enforce_policy("executor", "reviewer", "plan", text) # wrong channel
assert not ok and reason == "channel_misuse"
def test_enforce_policy_blocks_leadership_escalation():
text = "I will coordinate all agents to optimize throughput"
ok, reason = enforce_policy("executor", "reviewer", "execute", text)
assert not ok and reason == "leadership_escalation"
Simulation and Golden-File Tests
- Record “golden” sessions for known inputs. Re-run after changes to detect regressions.
- Simulate failures (tool crash, timeout, reviewer rejection) to test replan and retry logic.
- Use seedable synthetic data to drive deterministic scenarios.
Property-Based Tests
Define invariants such as “no executor message on plan channel” and randomly generate message sequences to ensure invariants hold. This can catch rare race conditions and complex routing bugs early.
Behavioral Evaluations
- Role discipline metric: fraction of messages that conform to channel-by-role constraints.
- False-positive blocking rate: how often valid messages are incorrectly blocked by policy.
- Task success rate: approved sessions over total sessions per task category.
- Cost and latency distribution: ensure performance SLOs are met.
Monitoring, Logging, and Observability
Observability must be first-class in multi-agent systems. Without granular telemetry, debugging role drift and routing issues becomes guesswork.
Structured Logs
Log each message with standardized fields:
- agent_id, role, channel, message_type
- latency_ms, model, token_usage
- policy_outcome, policy_reasons
- tool_calls (name, args_hash, duration_ms)
import logging
import json
logger = logging.getLogger("agent_team")
logger.setLevel(logging.INFO)
def log_event(event_type: str, **kwargs):
evt = {"event": event_type, **kwargs}
logger.info(json.dumps(evt))
Distributed Tracing (OpenTelemetry)
Instrument agent calls and tool invocations with spans and attributes. Attach session_id and step ids for cohesive traces.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
def traced_call_executor(step, context):
with tracer.start_as_current_span("executor.run") as span:
span.set_attribute("step.id", step["id"])
span.set_attribute("tools", "python_exec")
result = call_executor(step, context)
span.set_attribute("artifact.count", len(result.get("artifacts", [])))
return result
Key Metrics
- role_conflict_detected: count by role and channel.
- policy_violations: aggregate reasons.
- delegations_blocked: number of blocked delegations by edge.
- approval_rate: reviewer approvals over attempts.
- latency: p50/p95 for planner, executor, reviewer paths.
- cost: tokens per role and per session.
Production Deployment Considerations
Moving a multi-agent team into production requires rigorous controls on performance, safety, and cost.
Scaling and Concurrency
- Use asynchronous orchestration for concurrent tool calls and multi-executor parallelism.
- Introduce backpressure: if reviewer becomes a bottleneck, queue review requests with priorities.
- Sharding: route sessions by tenant or task type to separate worker pools.
Rate Limits and Quotas
- Track token budgets per session; abort or degrade when approaching thresholds.
- Implement exponential backoff for HTTP 429/5xx; surface partial results if permissible.
Security and Safety
- Sandbox execution: never run arbitrary code in-process. Use isolated, memory-limited containers.
- Network egress control: executor’s environment should have tight egress policies and allowlisted endpoints.
- Secret management: inject credentials via short-lived tokens and scoped roles.
- PII and compliance: scrub logs; apply redaction at the edge of your logging pipeline.
Cost Management
- Choose models per role: reviewers can often use lighter models; executors may need premium models only for hard steps.
- Cache plan steps and stable artifacts; do not recompute unless inputs changed.
- Track per-role token usage and set alerts for regressions.
Versioning and Change Control
- Version agent definitions (prompts, tools) and team policies.
- Use feature flags or AB tests to roll out changes gradually.
- Maintain migration scripts for session schemas and artifact stores.
Advanced Guardrails
- Output schemas and strong validators for artifacts (e.g., strict JSON schemas).
- Static policy scans of prompts to detect contradictions or leaky instructions.
- Semantic diffing: compare plan versions to detect unexpected scope changes.
Disaster Recovery and Graceful Degradation
- Persist session state and artifacts; support replay for triage.
- Fallback to single-agent mode with constrained scope if teams are unavailable.
- Offer human-in-the-loop escalation for high-stakes flows.
Worked Example Deep Dive: Enforcing Boundaries End-to-End
This deep dive expands on the PER pattern to highlight specific enforcement points and data flows. The goal is to illustrate how you can guarantee, with high probability, that:
- No planner executes tools.
- No executor emits plan messages or asserts leadership.
- No reviewer performs execution or re-planning.
Prompt Design and Structured Outputs
Each role has a schema-driven output. Your orchestrator should reject messages that do not conform. Example schema validators (conceptual):
from pydantic import BaseModel, Field, ValidationError
from typing import List
class PlanStep(BaseModel):
id: str
description: str
acceptance: str
class Plan(BaseModel):
message_type: str = Field("PLAN", const=True)
steps: List[PlanStep]
constraints: List[str]
handoff_to: str = Field("executor", const=True)
class Artifact(BaseModel):
message_type: str = Field("ARTIFACT", const=True)
step_id: str
artifacts: List[dict]
notes: str
class Review(BaseModel):
message_type: str = Field("REVIEW", const=True)
verdict: str # "approve" or "reject"
issues: List[str]
requests: List[str]
def validate_message(role: str, payload: dict):
try:
if role == "planner":
Plan(**payload)
elif role == "executor":
Artifact(**payload)
elif role == "reviewer":
Review(**payload)
else:
raise ValueError("unknown role")
return True, None
except ValidationError as e:
return False, str(e)
Policy Hooks on Ingress and Egress
Place policy checks at two boundaries:
- Ingress (before a role consumes a message) to prevent invalid delegations and channel misuse.
- Egress (before a role’s output publishes) to stop leadership escalation and forbidden content.
Together with schema validation, this provides robust, layered defense.
Timeouts, Retries, and Idempotency
- Step-level timeouts: tune by tool class (e.g., 15s for retrieval, 60–120s for execution).
- Retries: use bounded retry counts and exponential backoff; avoid re-executing non-idempotent steps.
- Idempotency keys: tag steps with correlation_id; store results in an artifact cache keyed by hash(inputs).
Backpressure and Flow Control
- Limit concurrent tool calls per session and globally; enforce quotas by role.
- Queue review requests if reviewer latency becomes a bottleneck; process FIFO with priority for critical tasks.
Advanced Communication Patterns
For complex teams, consider:
- Topic-based routing: tag messages with topic (“data_prep”, “analysis”, “writeup”).
- Capability-based discovery: executors register capabilities; planner delegates by capability match.
- Audit channels: reviewer emits structured rationales into a compliance log.
Sample End-to-End Exchange
// 1) Planner emits PLAN
{ "channel":"plan", "from":{"role":"planner"}, "to":{"role":"executor"},
"message_type":"PLAN", "payload":{ "steps":[...], "constraints":["no_network"] } }
// 2) Executor acknowledges and executes step S1
{ "channel":"execute", "from":{"role":"executor"}, "to":{"role":"reviewer"},
"message_type":"ARTIFACT", "payload":{ "step_id":"S1", "artifacts":[...]} }
// 3) Reviewer rejects with fix requests
{ "channel":"review", "from":{"role":"reviewer"}, "to":{"role":"executor"},
"message_type":"REVIEW", "payload":{ "verdict":"reject", "requests":["normalize headers"]} }
// 4) Executor applies fix and resubmits
{ "channel":"execute", "from":{"role":"executor"}, "to":{"role":"reviewer"},
"message_type":"ARTIFACT", "payload":{ "step_id":"S1", "artifacts":[...]} }
// 5) Reviewer approves; team returns final artifact
{ "channel":"review", "from":{"role":"reviewer"}, "to":{"role":"team"},
"message_type":"REVIEW", "payload":{ "verdict":"approve", "issues":[], "requests":[] } }
Tooling Guardrails for the Executor
Executors need strong guardrails to keep execution safe:
- Real sandbox: containerized environment, CPU and memory quotas, network policies, filesystem isolation.
- Tool adapters: whitelist of allowed functions with strict input schemas and output sanitization.
- Content filters: block code that exfiltrates secrets, spawns long-running processes, or writes outside sandbox.
In addition, record all code snippets and outputs for audit; this is invaluable during incident response.
Integrating with External Systems
Multi-agent teams often need to interact with databases, APIs, and internal systems. Keep the integration surface minimal and explicit:
- Wrap external APIs as first-party tools with typed schemas and business-purpose descriptions.
- Limit credentials to the executor, scoped to the least privilege necessary.
- Use read-only APIs for the planner and reviewer.
Example: Safe Data Access Tool (Executor Only)
SAFE_QUERY_TOOL = {
"type": "function",
"function": {
"name": "safe_sql_query",
"description": "Execute parameterized read-only SQL queries",
"parameters": {
"type": "object",
"properties": {
"sql": {"type": "string", "description": "SELECT-only query with LIMIT"},
"params": {"type": "object"}
},
"required": ["sql"]
}
}
}
Attach this tool only to the executor. The reviewer can be given a “schema validator” tool that accepts artifacts and checks them for adherence to a JSON schema, but cannot directly query data.
Governance: Ownership, Audits, and Compliance
Treat agent definitions, team policies, and routing logic as core infrastructure subject to review and change control.
- Code-review agent prompts and policies; lint for contradictions and missing constraints.
- Run policy audits: ensure no role has broader tool access than documented.
- Maintain audit trails: immutable logs of plans, executions, reviews, and policy decisions.
FAQ: Common Pitfalls and Remedies
- Planner generating code? Remove any execution tools from planner; reinforce instructions; add a denylist phrase detector.
- Executor attempting to “optimize the plan”? Add schema checks to reject PLAN output from executor; block plan channel for executor.
- Reviewer stalling? Set explicit SLAs and implement a watchdog timer to escalate to a human or auto-approve low-risk steps.
- Escalation loops? Enforce delegation DAG; reject cycles on ingress; cap replan iterations.
For deeper methodologies on prompt discipline and robust evaluation frameworks, see For a deeper exploration of this topic, our comprehensive guide on The Codex Microservices Playbook: 20 Prompts for Designing, Implementing, and Testing Distributed Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section. and For a deeper exploration of this topic, our comprehensive guide on The Complete Guide to Google’s Agent2Agent Protocol and OpenAI Codex Interoperability: Building Cross-Platform AI Agent Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section..
Appendix: Expanded Code for a Full PER Team Session
Below is a more complete script that integrates schema validation, policy enforcement, and basic telemetry. It uses the compatibility orchestrator to keep the logic fully explicit.
import json
import time
import uuid
from typing import Dict, Any, List, Tuple
from pydantic import ValidationError
from openai import OpenAI
client = OpenAI()
def now_ms():
return int(time.time() * 1000)
def policy_check_message(role: str, to_role: str, channel: str, payload: Dict[str, Any]) -> Tuple[bool, str]:
text = json.dumps(payload)
ok, reason = enforce_policy(role, to_role, channel, text)
if not ok:
return ok, reason
valid, err = validate_message(role, payload)
if not valid:
return False, f"schema_violation: {err}"
return True, "ok"
def orchestrate_per(user_task: str) -> Dict[str, Any]:
session_id = str(uuid.uuid4())
t0 = now_ms()
# 1) Plan
plan = call_planner(user_task)
ok, reason = policy_check_message("planner", "executor", "plan", plan)
if not ok:
return {"status": "aborted", "stage": "plan", "reason": reason}
# 2) Execute
artifacts = {"message_type": "ARTIFACTS", "session_id": session_id, "items": []}
for step in plan["steps"]:
step_t0 = now_ms()
art = call_executor(step, context={"user_task": user_task})
ok, reason = policy_check_message("executor", "reviewer", "execute", art)
if not ok:
return {"status": "aborted", "stage": "execute", "step": step["id"], "reason": reason}
art["latency_ms"] = now_ms() - step_t0
artifacts["items"].append(art)
# 3) Review
review = call_reviewer(artifacts, plan)
ok, reason = policy_check_message("reviewer", "team", "review", review)
if not ok:
return {"status": "aborted", "stage": "review", "reason": reason}
if review["verdict"] == "approve":
return {
"status": "ok",
"latency_ms": now_ms() - t0,
"plan": plan,
"artifacts": artifacts,
"review": review
}
else:
# Merge requested fixes into a new plan request
replan_req = {
"message_type": "REPLAN_REQUEST",
"issues": review.get("issues", []),
"requests": review.get("requests", [])
}
return {
"status": "needs_replan",
"latency_ms": now_ms() - t0,
"plan": plan,
"artifacts": artifacts,
"review": review,
"replan_request": replan_req
}
if __name__ == "__main__":
user_csv = "region,month,sales\nEMEA,1,100\nNA,1,200\nEMEA,2,150\nNA,2,180\nAPAC,1,90\nAPAC,2,95"
task = f"Given this CSV, compute per-region totals and a short summary.\nCSV:\n{user_csv}"
result = orchestrate_per(task)
print(json.dumps(result, indent=2))
Conclusion and Next Steps
Multi-agent teams unlock reliability, scale, and clarity when building complex AI workflows. OpenAI’s agent-team feature formalizes these patterns by giving you first-class roles, delegation rules, and policy enforcement. The Planner–Executor–Reviewer triad provides a robust default architecture:
- Planner defines the work with clear acceptance criteria.
- Executor performs tool-mediated steps within a strict sandbox.
- Reviewer enforces correctness and policy compliance, acting as the project’s gatekeeper.
Preventing role conflicts is not one technique but a layered approach: system prompts, tool scoping, team policies, and runtime checks. With structured communication schemas, strong error handling, and comprehensive observability, you can operate multi-agent teams confidently in production.
As you expand, consider more sophisticated topologies (multiple executors, specialist reviewers), integrate with enterprise systems via tightly-scoped tools, and invest in testing harnesses that simulate edge cases. For broader architectural guidance and adjacent techniques, explore For a deeper exploration of this topic, our comprehensive guide on The Codex Application Modernization Playbook: 15 Prompts for Migrating Legacy Systems to Cloud-Native Architecture provides detailed strategies and implementation frameworks that complement the approaches discussed in this section. and For a deeper exploration of this topic, our comprehensive guide on The Complete Guide to Google’s Agent2Agent Protocol and OpenAI Codex Interoperability: Building Cross-Platform AI Agent Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section..
Access 40,000+ AI Prompts for ChatGPT, Claude & Codex — Free!
Subscribe to get instant access to our complete Notion Prompt Library — the largest curated collection of prompts for ChatGPT, Claude, OpenAI Codex, and other leading AI models. Optimized for real-world workflows across coding, research, content creation, and business.



