How to Build Multi-Agent Teams with OpenAI’s Agent-Team Feature: Preventing Role Conflicts and Managing Agent Hierarchy

July 5, 2026

How to Build Multi-Agent Teams with OpenAI’s Agent-Team Feature: Preventing Role Conflicts and Managing Agent Hierarchy

Introduction: Why Multi-Agent Teams and Why Now
Multi-Agent Architectures: Concepts and Patterns
OpenAI’s Agent-Team Feature: Concepts, Capabilities, and Mechanics
Designing Agent Roles and Responsibilities
Preventing Role Conflicts and Role Drift
Implementing Hierarchy and Delegation Patterns
Communication Protocols Between Agents
Error Handling and Resilience in Multi-Agent Systems
Practical Examples with Python: 3-Agent Team (Planner, Executor, Reviewer)
Testing Multi-Agent Workflows
Monitoring, Logging, and Observability
Production Deployment Considerations
Conclusion and Next Steps

Introduction: Why Multi-Agent Teams and Why Now

For years, single-agent patterns dominated LLM applications: one model with a well-engineered prompt, a handful of tools, and careful orchestration. As workloads evolved—from one-shot generation to multi-step, cross-domain reasoning and tool use—teams of specialized agents have emerged as a pragmatic way to tame complexity. A team of agents mirrors real-world engineering processes: a planner who decomposes tasks, an executor who implements steps with the right tools, and a reviewer who enforces quality and safety standards.

OpenAI’s agent-team feature operationalizes this pattern. It provides structured roles, controlled delegation, and team-level policies so your system can scale from a single assistant to a coordinated, auditable pipeline of experts. Yet, as in real teams, success hinges on clear role boundaries and a robust hierarchy. Without them, agents can “role drift,” overstep their authority, or begin to “appoint themselves” as leaders, causing cascading errors and unpredictable behavior.

This tutorial offers a complete, practical guide to building multi-agent teams with OpenAI’s agent-team feature, with strong emphasis on:

Designing precise roles and boundaries
Preventing role conflicts and leader-election pathologies
Implementing hierarchy and delegation patterns
Defining communication protocols among agents
Error handling, testing, monitoring, and productionization

We will implement a three-agent pattern—planner, executor, and reviewer—with complete Python examples, and we will show how to enforce role boundaries at multiple layers: instruction design, tool permissions, policies, and runtime checks.

Related topics worth exploring as you scale include For a deeper exploration of this topic, our comprehensive guide on The Codex Microservices Playbook: 20 Prompts for Designing, Implementing, and Testing Distributed Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section., For a deeper exploration of this topic, our comprehensive guide on The Complete Guide to Google’s Agent2Agent Protocol and OpenAI Codex Interoperability: Building Cross-Platform AI Agent Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section., and For a deeper exploration of this topic, our comprehensive guide on The Codex Application Modernization Playbook: 15 Prompts for Migrating Legacy Systems to Cloud-Native Architecture provides detailed strategies and implementation frameworks that complement the approaches discussed in this section..

Multi-Agent Architectures: Concepts and Patterns

A multi-agent system decomposes a complex task into specialized components. Each agent is a policy (LLM + instructions + tools) responsible for a subset of the workflow. The key advantages include:

Separation of concerns: each agent focuses on a narrow set of responsibilities.
Better safety and auditability: reviewers and gatekeepers are explicit roles.
Performance: parallelization, caching, and specialization reduce end-to-end latency and cost.
Operational clarity: when things fail, you know which role failed and why.

Common Patterns

Planner–Executor–Reviewer (PER): a canonical triad where the planner decomposes work, the executor performs steps with tools, and the reviewer checks outputs for quality and policy compliance.
Coordinator–Specialists: a coordinator agent routes sub-tasks to one of several specialists (e.g., data analysis, retrieval, drafting) and aggregates results.
Supervisor Trees: hierarchical supervision where sub-teams report to mid-level supervisors who aggregate and escalate to a top-level orchestrator.
Critic–Proposer Loops: a proposer generates artifacts; a critic iteratively improves them with constraints and acceptance criteria.

Key Risks in Multi-Agent Designs

Role drift: agents stray from their remit, e.g., executor attempts to redefine the plan.
Leader escalation: an agent tries to assume coordination authority (“I will manage the team”).
Circular delegation: agents bounce tasks back and forth without termination.
Over-delegation: proliferation of unnecessary sub-tasks that increase cost and latency.
Ambiguous interfaces: agents exchange vague messages, creating inconsistencies and rework.

OpenAI’s Agent-Team Feature: Concepts, Capabilities, and Mechanics

OpenAI’s agent-team feature streamlines the creation and management of role-based multi-agent systems. It enables:

Defining agents as reusable assets with instructions, tools, and metadata (e.g., role, capabilities).
Forming teams with membership, hierarchy, and policies for delegation and tool-use boundaries.
Running team sessions that coordinate message passing, routing, and execution lifecycles.
Injecting team-level guardrails for role enforcement and safety checks.

Core Concepts

Agent: a configured LLM persona with tools, system instructions, and policy metadata.
Team: a set of agents plus coordination rules (who can delegate, to whom, and when).
Session: a runtime container for a task, holding messages, tool outputs, artifacts, and state.
Policy: declarative or programmatic rules describing allowed actions and escalation conditions.
Channel: a logical communication pathway (plan, execute, review, control) with message schemas.

Lifecycle

Define agents and their tools.
Create a team with hierarchy and delegation policies.
Open a session for a user request.
Send the initial message to the team; the planner forms a plan.
The team routes tasks to executors; reviewer validates outputs.
Session terminates with a final artifact or response.

SDK Snapshot (Python)

Below we will show two pathways:

Using agent-team primitives to declare a team and run a coordinated session.
A compatibility orchestration layer that achieves the same pattern if your SDK version doesn’t expose agent-team helpers directly. This is useful for experimentation, offline unit testing, or custom routing.

Designing Agent Roles and Responsibilities

Start with a crisp role charter, akin to a software team’s RACI (Responsible, Accountable, Consulted, Informed). The following matrix illustrates clear boundaries for a PER team:

Role	Primary Responsibilities	Explicit Non-Responsibilities	Allowed Tools	Escalation Paths
Planner	Decompose user request into steps; define acceptance criteria; assign steps to executor(s).	Does not run code; does not write files; does not change outputs after reviewer approval.	Search/Retrieval (read-only); planning aides (no code execution).	Can request reviewer to validate plan; can re-plan after reviewer feedback.
Executor	Perform tool calls; implement steps; produce artifacts; track intermediate state.	Does not change plan scope; does not approve final outputs; cannot appoint itself as leader.	Code interpreter; sandboxed shell; file I/O (sandbox); APIs needed for execution.	Requests clarifications from planner; submits for review to reviewer.
Reviewer	Validate correctness, safety, and policy compliance; request fixes; approve/reject.	Does not execute tools; does not decompose tasks; cannot re-architect the plan.	None (LLM-only) or read-only tools.	Can send change requests to executor or planner; final approval authority.

Boundary-Centric Instructions

Each agent’s system prompt must explicitly encode boundaries and escalation rules. Make boundaries machine-checkable by:

Stating disallowed behaviors (“Do not execute code” for planner).
Referencing channels (“Use only the plan channel to emit plans”).
Embedding structured output schemas (JSON) that your orchestrator can validate.
Documenting escalation conditions with keywords that policies can recognize (“REQUEST_REPLAN”, “REQUEST_FIX”).

Tool Permission Scoping

Minimize tool surfaces to reduce attack area and control cost. Capabilities are granted per agent:

Planner: retrieval-only tools; no direct network or code execution.
Executor: code interpreter, sandbox file system; restricted external calls through vetted functions.
Reviewer: LLM-only or read-only validation tools (e.g., schema validators).

Preventing Role Conflicts and Role Drift

Preventing role conflicts is the spine of multi-agent reliability. When an agent oversteps its role—especially trying to coordinate others—the system’s invariants collapse. Protect against this at four layers:

Instruction-level constraints
Tool permission boundaries
Team-level policies (declarative and programmatic)
Runtime checks and auditing

1) Instruction-Level Constraints

Positively state responsibilities and negatively state prohibitions.
Use strong, unambiguous language: “You must not…”
Include structured outputs with explicit channels and message types.

2) Tool Boundaries

Design toolkits per role; never share executor-level tools with planner or reviewer.
Run-time deny tool invocations from non-privileged agents; return policy errors.

3) Team-Level Policies

Implement role enforcement policies that inspect messages for “leadership” phrases or attempt to delegate beyond the allowed graph. Use a fixed “leadership token” concept—only specific roles possess a virtual token to emit control-plane messages. Others are blocked by policy.

4) Runtime Checks and Auditing

All messages tagged with agent_id, role, channel, message_type.
Validate each message against the team policy at ingress and egress.
Emit metrics: role_conflict_detected, blocked_delegations, policy_violations.

Leader-Escalation and Role-Drift Detectors

Use simple but effective detectors:

Phrase detector: block messages containing “I will coordinate”, “I will act as manager”, “I will assign agents”.
Channel misuse: any executor message on control/plan channel triggers a violation.
Delegation graph check: verify target_of_delegation ∈ allowed_successors[agent.role].

Implementing Hierarchy and Delegation Patterns

A hierarchy creates clarity: who plans, who executes, who approves. Delegation patterns refine how work flows.

Topologies

Star (central planner): all execution routed from planner to executors; reviewer approves.
Tree (supervisors): intermediate supervisors manage sub-teams, allowing scale beyond a single planner.
DAG of specialists: tasks can be routed to different specialists; reviewer remains a terminal gate.

Delegation Rules

Planner can delegate to executor(s) and request reviewer validation of the plan.
Executor cannot delegate to planner except for clarifications (“CLARIFY” message).
Executor must submit artifacts to reviewer; reviewer either approves or requests fixes.
Only reviewer can signal “FINAL_APPROVAL”.

Termination Conditions

Reviewer issues FINAL_APPROVAL and returns artifacts to caller.
Planner explicitly cancels session with CANCEL(reason) if irreconcilable constraints arise.
Timeouts or policy violations can abort the run; artifacts stored for triage.

Communication Protocols Between Agents

Define strict envelopes and channels for inter-agent messages to make intent parseable and auditable.

Message Envelope

{
  "session_id": "uuid",
  "message_id": "uuid",
  "timestamp": "RFC3339",
  "from": {"agent_id": "uuid", "role": "planner|executor|reviewer"},
  "to": {"agent_id": "uuid|team", "role": "executor|reviewer|planner|team"},
  "channel": "plan|execute|review|control",
  "message_type": "PLAN|TASK|ARTIFACT|REVIEW_REQUEST|REVIEW|APPROVAL|REJECT|CLARIFY",
  "correlation_id": "uuid-of-parent",
  "payload": {},
  "policy": {
    "delegation_allowed": true|false,
    "tools_scope": ["code_interpreter", "..."],
    "safety_flags": ["no_network", "read_only"]
  }
}

Channels

plan: planner emits plans, updates, and re-plans.
execute: executor receives tasks and posts artifacts.
review: reviewer receives artifacts and emits REVIEW/APPROVAL/REJECT.
control: reserved for orchestration (session start, abort, metrics).

Structured Payload Schemas

Examples:

// PLAN payload
{
  "steps": [
    { "id": "S1", "description": "Load CSV", "acceptance": "Dataframe loaded" },
    { "id": "S2", "description": "Compute aggregates", "acceptance": "Summary table JSON" }
  ],
  "constraints": ["no_network", "max_runtime:120s"],
  "handoff_to": "executor"
}

// TASK payload
{
  "plan_step_id": "S1",
  "instructions": "Load CSV from provided content",
  "inputs": { "csv": "<string>" },
  "expected_outputs": ["dataframe_preview.png", "schema.json"]
}

// REVIEW payload
{
  "artifact_ids": ["A1", "A2"],
  "checks": ["correctness", "policy_compliance"],
  "verdict": "approve|reject",
  "notes": "..."
}

Error Handling and Resilience in Multi-Agent Systems

Error handling must be role-aware and policy-driven. Treat failures as first-class events with structured semantics.

Error Taxonomy

Tool Errors: failures in code interpreter, network, or external APIs.
Policy Violations: role conflicts, forbidden tools, unsafe outputs.
Protocol Errors: malformed message envelopes or channels misuse.
Timeouts: agent exceeds allocated wall-clock time.
Content Risks: hallucinations, PII leakage, toxicity, or unsafe code.

Strategies

Retries with jitter for transient tool errors; strict caps to avoid cost blowups.
Fallbacks: switch models or simplify tasks with reduced scope.
Circuit breakers: temporarily block tools or agents after repeated failures.
Escalation: notify planner for re-plan after executor failures; reviewer for policy violations.
Idempotency: include correlation_id to deduplicate retried steps.

Practical Examples with Python: 3-Agent Team (Planner, Executor, Reviewer)

This section provides end-to-end code. We present:

A team declared and run via agent-team primitives.
A compatible orchestration layer for environments without native agent-team support (helpful for testing and custom routing).

Prerequisites

Python 3.9+
OpenAI Python SDK (latest)
Optional: opentelemetry-sdk for tracing (used later)

Environment Setup

pip install openai opentelemetry-sdk pydantic rich

A) Declaring a Team with Agent-Team Primitives

The following example uses agent-team constructs to define roles, assemble a team, and run a session. If your SDK version differs, adapt names accordingly or use the compatibility orchestrator shown afterward.

from typing import Dict, Any, List, Optional
import json
import time
import uuid
from openai import OpenAI

client = OpenAI()

# 1) Define strict system prompts with role boundaries and structured outputs.
PLANNER_SYS = """
Role: Planner
You are responsible for decomposing the user's request into an executable plan of steps.
You must NOT execute code, call tools that write files, or assume leadership of other agents.
Output only JSON with the schema:
{
  "message_type": "PLAN",
  "steps": [
    { "id": "S<n>", "description": "...", "acceptance": "..." }
  ],
  "constraints": ["..."],
  "handoff_to": "executor"
}
Use only the 'plan' channel. Never emit control messages.
If execution fails, propose a minimal REPLAN with changed steps.
"""

EXECUTOR_SYS = """
Role: Executor
You perform the plan's steps using ONLY the tools granted to you.
You must NOT modify the plan's scope or assume leadership.
You must NOT emit PLAN messages. Use 'execute' channel exclusively.
On completion of each step, output JSON:
{
  "message_type": "ARTIFACT",
  "step_id": "S<n>",
  "artifacts": [{ "id": "A<n>", "type": "file|image|text|json", "uri": "..." }],
  "notes": "..."
}
"""

REVIEWER_SYS = """
Role: Reviewer
You validate correctness and policy compliance. You have no tool access.
You must NOT execute or modify the plan; do not create new tasks.
Output only JSON with schema:
{
  "message_type": "REVIEW",
  "verdict": "approve|reject",
  "issues": ["..."],
  "requests": ["..."]  # suggestions for fixes
}
Use the 'review' channel.
"""

# 2) Create agents with scoped tools.
# For illustration, the executor has a code interpreter. Planner and Reviewer are LLM-only.
planner = client.agents.create(
    name="Planner",
    model="gpt-4o-mini",
    instructions=PLANNER_SYS,
    tools=[],  # planner has no execution tools
    metadata={"role": "planner", "channels": ["plan"]}
)

executor = client.agents.create(
    name="Executor",
    model="gpt-4o",
    instructions=EXECUTOR_SYS,
    tools=[{"type": "code_interpreter"}],  # limited toolset
    metadata={"role": "executor", "channels": ["execute"]}
)

reviewer = client.agents.create(
    name="Reviewer",
    model="gpt-4o-mini",
    instructions=REVIEWER_SYS,
    tools=[],  # reviewer has no tools
    metadata={"role": "reviewer", "channels": ["review"]}
)

# 3) Define team-level policy for delegation and role enforcement.
# Policies include allowed delegations and prohibited phrases for leadership escalation.
team_policy = {
    "delegation_graph": {
        "planner": ["executor", "reviewer"],
        "executor": ["reviewer"],  # executor can only send to reviewer
        "reviewer": []             # reviewer never delegates
    },
    "prohibited_leadership_phrases": [
        "I will coordinate", "I will manage the team", "I will assign agents"
    ],
    "allowed_channels": {
        "planner": ["plan"],
        "executor": ["execute"],
        "reviewer": ["review"]
    },
    "deny_on_channel_misuse": True
}

# 4) Create the team with hierarchy encoded.
team = client.agent_teams.create(
    name="PER Team",
    members=[planner.id, executor.id, reviewer.id],
    hierarchy={
        "leader": planner.id,  # logical leader for initiating plans (not for execution)
        "review_gate": reviewer.id
    },
    policy=team_policy
)

# 5) Open a session and run a task.
session = client.agent_teams.sessions.create(
    team_id=team.id,
    metadata={"purpose": "demo-per", "user": "example_user"}
)

user_task = "Given a CSV of monthly sales (provided below), compute per-region totals and produce a brief summary. CSV:\nregion,month,sales\nEMEA,1,100\nNA,1,200\nEMEA,2,150\nNA,2,180\nAPAC,1,90\nAPAC,2,95"

# Start the run: planner generates the plan; executor executes; reviewer reviews.
run = client.agent_teams.runs.create(
    team_id=team.id,
    session_id=session.id,
    input=user_task,
    stream=True,  # stream events for fine-grained control
)

# 6) Consume stream and enforce runtime policies (defense in depth).
for event in run:
    # Typical event types: "message", "tool_call", "artifact", "policy_violation", "completed"
    et = event["type"]
    if et == "message":
        msg = event["data"]
        role = msg["from"]["role"]
        channel = msg["channel"]
        # Channel enforcement:
        if channel not in team_policy["allowed_channels"][role]:
            client.agent_teams.runs.action(
                team_id=team.id, session_id=session.id, action="block", reason="channel_misuse"
            )
            continue
        # Leadership phrase detection:
        text = msg.get("text", "")
        for phrase in team_policy["prohibited_leadership_phrases"]:
            if phrase.lower() in text.lower():
                client.agent_teams.runs.action(
                    team_id=team.id, session_id=session.id, action="block", reason="leadership_escalation"
                )
                break

    elif et == "tool_call":
        # Verify agent role vs tool permissions
        call = event["data"]
        role = call["from"]["role"]
        if role != "executor":
            client.agent_teams.runs.action(
                team_id=team.id, session_id=session.id, action="deny_tool", reason="role_not_allowed"
            )
        # else: allow; agent-team runtime will sandbox accordingly

    elif et == "artifact":
        # Forward artifacts to reviewer automatically as per policy
        artifact = event["data"]
        client.agent_teams.messages.create(
            team_id=team.id, session_id=session.id,
            to={"role": "reviewer", "agent_id": reviewer.id},
            channel="review",
            payload={
                "message_type": "REVIEW_REQUEST",
                "artifact_ids": [artifact["id"]],
                "notes": "Auto-submitted by policy"
            }
        )

    elif et == "policy_violation":
        # Log and potentially terminate
        violation = event["data"]
        print("Policy violation:", violation)
        client.agent_teams.runs.action(
            team_id=team.id, session_id=session.id, action="abort", reason="policy_violation"
        )
        break

    elif et == "completed":
        result = event["data"]
        print("Run completed. Summary:")
        print(json.dumps(result, indent=2))
        break

The example above demonstrates defense-in-depth: agent instructions, tool scoping, team policy, and live runtime checks. Even if a model tries to deviate, your orchestration code enforces boundaries.

B) Compatibility Orchestrator (Manual Routing with Agents)

If agent-team helpers are not available in your environment, you can implement the same semantics with explicit orchestration. This gives you flexibility in testing and makes the policy logic transparent.

from typing import Dict, Any, Tuple
import json
import uuid
from openai import OpenAI

client = OpenAI()

def make_message(from_role: str, to_role: str, channel: str, message_type: str, payload: Dict[str, Any]) -> Dict[str, Any]:
    return {
        "session_id": str(uuid.uuid4()),
        "message_id": str(uuid.uuid4()),
        "from": {"role": from_role},
        "to": {"role": to_role},
        "channel": channel,
        "message_type": message_type,
        "payload": payload
    }

# Create three "agents" as call patterns with instructions and optional tools.
def call_planner(user_task: str) -> Dict[str, Any]:
    resp = client.responses.create(
        model="gpt-4o-mini",
        input=[
            {"role": "system", "content": PLANNER_SYS},
            {"role": "user", "content": f"USER_TASK:\n{user_task}\nOutput JSON only."}
        ],
    )
    text = resp.output_text
    return json.loads(text)

def call_executor(step: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
    # Provide the code interpreter tool via function-calling pattern
    tools = [{
        "type": "function",
        "function": {
            "name": "python_exec",
            "description": "Execute Python code in a sandbox and return stdout and generated files",
            "parameters": {
                "type": "object",
                "properties": {"code": {"type": "string"}},
                "required": ["code"]
            }
        }
    }]
    messages = [
        {"role": "system", "content": EXECUTOR_SYS},
        {
            "role": "user", "content": json.dumps({
                "step": step,
                "context": context,
                "constraints": ["no_network"]
            })
        }
    ]
    # We will iteratively resolve tool calls until the model returns an ARTIFACT message.
    state = {"artifacts": []}
    while True:
        resp = client.responses.create(model="gpt-4o", input=messages, tools=tools, tool_choice="auto")
        for item in resp.output:
            if item.type == "message":
                content = getattr(item, "content", [])
                if content:
                    text = content[0].get("text", {}).get("value")
                    try:
                        data = json.loads(text)
                        if data.get("message_type") == "ARTIFACT":
                            return data
                    except Exception:
                        # If it's non-JSON text, continue loop
                        messages.append({"role": "assistant", "content": text})
            elif item.type == "tool_call":
                tc = item
                if tc.name == "python_exec":
                    code = tc.arguments.get("code", "")
                    # For demo, we simulate a sandbox. In production, secure a real sandbox.
                    try:
                        # DANGEROUS: For demonstration only. Replace with real sandbox.
                        local_ns = {}
                        exec(code, {}, local_ns)
                        result = str(local_ns.get("result", ""))
                    except Exception as e:
                        result = f"ERROR: {e}"
                    tool_result = {
                        "tool_name": "python_exec",
                        "result": result
                    }
                    messages.append({"role": "tool", "name": "python_exec", "content": json.dumps(tool_result)})
        # Loop to let the assistant produce ARTIFACT after handling tool outputs

def call_reviewer(artifacts: Dict[str, Any], plan: Dict[str, Any]) -> Dict[str, Any]:
    resp = client.responses.create(
        model="gpt-4o-mini",
        input=[
            {"role": "system", "content": REVIEWER_SYS},
            {"role": "user", "content": json.dumps({"plan": plan, "artifacts": artifacts})}
        ]
    )
    text = resp.output_text
    return json.loads(text)

def enforce_policy(from_role: str, to_role: str, channel: str, text: str) -> Tuple[bool, str]:
    # Basic enforcement: channels and leadership phrases
    if channel not in team_policy["allowed_channels"][from_role]:
        return False, "channel_misuse"
    for phrase in team_policy["prohibited_leadership_phrases"]:
        if phrase.lower() in text.lower():
            return False, "leadership_escalation"
    # Delegation graph
    if to_role not in team_policy["delegation_graph"][from_role] and to_role != "team":
        return False, "invalid_delegation"
    return True, "ok"

def run_session(user_task: str) -> Dict[str, Any]:
    # 1) Planner creates plan
    plan = call_planner(user_task)
    msg = make_message("planner", "executor", "plan", "PLAN", plan)
    ok, reason = enforce_policy("planner", "executor", "plan", json.dumps(plan))
    if not ok:
        return {"status": "aborted", "reason": reason}

    # 2) Executor runs steps
    artifacts = {"message_type": "ARTIFACTS", "items": []}
    for step in plan["steps"]:
        art = call_executor(step, context={"user_task": user_task})
        exec_msg = make_message("executor", "reviewer", "execute", "ARTIFACT", art)
        ok, reason = enforce_policy("executor", "reviewer", "execute", json.dumps(art))
        if not ok:
            return {"status": "aborted", "reason": reason}
        artifacts["items"].append(art)

    # 3) Reviewer validates
    review = call_reviewer(artifacts, plan)
    rev_msg = make_message("reviewer", "team", "review", "REVIEW", review)
    ok, reason = enforce_policy("reviewer", "team", "review", json.dumps(review))
    if not ok:
        return {"status": "aborted", "reason": reason}

    if review["verdict"] == "approve":
        return {"status": "ok", "artifacts": artifacts, "review": review}
    else:
        # Optionally trigger replanning
        repl = {"message_type": "REPLAN_REQUEST", "issues": review["issues"], "requests": review["requests"]}
        return {"status": "needs_replan", "review": review, "replan": repl}

if __name__ == "__main__":
    task = "Given a CSV of monthly sales (provided below), compute per-region totals and produce a brief summary. CSV:\nregion,month,sales\nEMEA,1,100\nNA,1,200\nEMEA,2,150\nNA,2,180\nAPAC,1,90\nAPAC,2,95"
    result = run_session(task)
    print(json.dumps(result, indent=2))

This orchestrator reinforces policy at every hop and shows how to make role boundaries explicit and machine-enforced. It is straightforward to substitute real sandboxes and production-grade tooling for the mocked function tool handler.

Role Boundary Highlights in the Example

Planner cannot call tools and only emits a PLAN JSON object on the plan channel.
Executor can call tools, but cannot send anything on the plan channel.
Reviewer cannot call tools and can only emit a REVIEW JSON object.
Delegation graph disallows executor→planner delegation except with explicit CLARIFY rules (left out here for brevity but easy to add as a channel and message_type).

Extensions

Multiple executors: route steps by capability tags (e.g., “python”, “nlp”, “viz”).
Reviewer variants: compliance reviewer vs. technical reviewer chained in sequence.
Caching: cache plan fragments and executor artifacts to reduce cost.
Long-running tasks: persist session state and artifacts in object storage.

Testing Multi-Agent Workflows

Testing is where multi-agent systems become tractable. Build confidence with multiple test layers:

Unit Tests for Policies and Prompts

Verify policy logic without incurring model calls.

import pytest

def test_enforce_policy_allows_valid_paths():
    text = json.dumps({"message_type": "PLAN"})
    ok, reason = enforce_policy("planner", "executor", "plan", text)
    assert ok and reason == "ok"

def test_enforce_policy_blocks_channel_misuse():
    text = json.dumps({"message_type": "ARTIFACT"})
    ok, reason = enforce_policy("executor", "reviewer", "plan", text)  # wrong channel
    assert not ok and reason == "channel_misuse"

def test_enforce_policy_blocks_leadership_escalation():
    text = "I will coordinate all agents to optimize throughput"
    ok, reason = enforce_policy("executor", "reviewer", "execute", text)
    assert not ok and reason == "leadership_escalation"

Simulation and Golden-File Tests

Record “golden” sessions for known inputs. Re-run after changes to detect regressions.
Simulate failures (tool crash, timeout, reviewer rejection) to test replan and retry logic.
Use seedable synthetic data to drive deterministic scenarios.

Property-Based Tests

Define invariants such as “no executor message on plan channel” and randomly generate message sequences to ensure invariants hold. This can catch rare race conditions and complex routing bugs early.

Behavioral Evaluations

Role discipline metric: fraction of messages that conform to channel-by-role constraints.
False-positive blocking rate: how often valid messages are incorrectly blocked by policy.
Task success rate: approved sessions over total sessions per task category.
Cost and latency distribution: ensure performance SLOs are met.

Monitoring, Logging, and Observability

Observability must be first-class in multi-agent systems. Without granular telemetry, debugging role drift and routing issues becomes guesswork.

Structured Logs

Log each message with standardized fields:

agent_id, role, channel, message_type
latency_ms, model, token_usage
policy_outcome, policy_reasons
tool_calls (name, args_hash, duration_ms)

import logging
import json
logger = logging.getLogger("agent_team")
logger.setLevel(logging.INFO)

def log_event(event_type: str, **kwargs):
    evt = {"event": event_type, **kwargs}
    logger.info(json.dumps(evt))

Distributed Tracing (OpenTelemetry)

Instrument agent calls and tool invocations with spans and attributes. Attach session_id and step ids for cohesive traces.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

def traced_call_executor(step, context):
    with tracer.start_as_current_span("executor.run") as span:
        span.set_attribute("step.id", step["id"])
        span.set_attribute("tools", "python_exec")
        result = call_executor(step, context)
        span.set_attribute("artifact.count", len(result.get("artifacts", [])))
        return result

Key Metrics

role_conflict_detected: count by role and channel.
policy_violations: aggregate reasons.
delegations_blocked: number of blocked delegations by edge.
approval_rate: reviewer approvals over attempts.
latency: p50/p95 for planner, executor, reviewer paths.
cost: tokens per role and per session.

Production Deployment Considerations

Moving a multi-agent team into production requires rigorous controls on performance, safety, and cost.

Scaling and Concurrency

Use asynchronous orchestration for concurrent tool calls and multi-executor parallelism.
Introduce backpressure: if reviewer becomes a bottleneck, queue review requests with priorities.
Sharding: route sessions by tenant or task type to separate worker pools.

Rate Limits and Quotas

Track token budgets per session; abort or degrade when approaching thresholds.
Implement exponential backoff for HTTP 429/5xx; surface partial results if permissible.

Security and Safety

Sandbox execution: never run arbitrary code in-process. Use isolated, memory-limited containers.
Network egress control: executor’s environment should have tight egress policies and allowlisted endpoints.
Secret management: inject credentials via short-lived tokens and scoped roles.
PII and compliance: scrub logs; apply redaction at the edge of your logging pipeline.

Cost Management

Choose models per role: reviewers can often use lighter models; executors may need premium models only for hard steps.
Cache plan steps and stable artifacts; do not recompute unless inputs changed.
Track per-role token usage and set alerts for regressions.

Versioning and Change Control

Version agent definitions (prompts, tools) and team policies.
Use feature flags or AB tests to roll out changes gradually.
Maintain migration scripts for session schemas and artifact stores.

Advanced Guardrails

Output schemas and strong validators for artifacts (e.g., strict JSON schemas).
Static policy scans of prompts to detect contradictions or leaky instructions.
Semantic diffing: compare plan versions to detect unexpected scope changes.

Disaster Recovery and Graceful Degradation

Persist session state and artifacts; support replay for triage.
Fallback to single-agent mode with constrained scope if teams are unavailable.
Offer human-in-the-loop escalation for high-stakes flows.

Worked Example Deep Dive: Enforcing Boundaries End-to-End

This deep dive expands on the PER pattern to highlight specific enforcement points and data flows. The goal is to illustrate how you can guarantee, with high probability, that:

No planner executes tools.
No executor emits plan messages or asserts leadership.
No reviewer performs execution or re-planning.

Prompt Design and Structured Outputs

Each role has a schema-driven output. Your orchestrator should reject messages that do not conform. Example schema validators (conceptual):

from pydantic import BaseModel, Field, ValidationError
from typing import List

class PlanStep(BaseModel):
    id: str
    description: str
    acceptance: str

class Plan(BaseModel):
    message_type: str = Field("PLAN", const=True)
    steps: List[PlanStep]
    constraints: List[str]
    handoff_to: str = Field("executor", const=True)

class Artifact(BaseModel):
    message_type: str = Field("ARTIFACT", const=True)
    step_id: str
    artifacts: List[dict]
    notes: str

class Review(BaseModel):
    message_type: str = Field("REVIEW", const=True)
    verdict: str  # "approve" or "reject"
    issues: List[str]
    requests: List[str]

def validate_message(role: str, payload: dict):
    try:
        if role == "planner":
            Plan(**payload)
        elif role == "executor":
            Artifact(**payload)
        elif role == "reviewer":
            Review(**payload)
        else:
            raise ValueError("unknown role")
        return True, None
    except ValidationError as e:
        return False, str(e)

Policy Hooks on Ingress and Egress

Place policy checks at two boundaries:

Ingress (before a role consumes a message) to prevent invalid delegations and channel misuse.
Egress (before a role’s output publishes) to stop leadership escalation and forbidden content.

Together with schema validation, this provides robust, layered defense.

Timeouts, Retries, and Idempotency

Step-level timeouts: tune by tool class (e.g., 15s for retrieval, 60–120s for execution).
Retries: use bounded retry counts and exponential backoff; avoid re-executing non-idempotent steps.
Idempotency keys: tag steps with correlation_id; store results in an artifact cache keyed by hash(inputs).

Backpressure and Flow Control

Limit concurrent tool calls per session and globally; enforce quotas by role.
Queue review requests if reviewer latency becomes a bottleneck; process FIFO with priority for critical tasks.

Advanced Communication Patterns

For complex teams, consider:

Topic-based routing: tag messages with topic (“data_prep”, “analysis”, “writeup”).
Capability-based discovery: executors register capabilities; planner delegates by capability match.
Audit channels: reviewer emits structured rationales into a compliance log.

Sample End-to-End Exchange

// 1) Planner emits PLAN
{ "channel":"plan", "from":{"role":"planner"}, "to":{"role":"executor"},
  "message_type":"PLAN", "payload":{ "steps":[...], "constraints":["no_network"] } }

// 2) Executor acknowledges and executes step S1
{ "channel":"execute", "from":{"role":"executor"}, "to":{"role":"reviewer"},
  "message_type":"ARTIFACT", "payload":{ "step_id":"S1", "artifacts":[...]} }

// 3) Reviewer rejects with fix requests
{ "channel":"review", "from":{"role":"reviewer"}, "to":{"role":"executor"},
  "message_type":"REVIEW", "payload":{ "verdict":"reject", "requests":["normalize headers"]} }

// 4) Executor applies fix and resubmits
{ "channel":"execute", "from":{"role":"executor"}, "to":{"role":"reviewer"},
  "message_type":"ARTIFACT", "payload":{ "step_id":"S1", "artifacts":[...]} }

// 5) Reviewer approves; team returns final artifact
{ "channel":"review", "from":{"role":"reviewer"}, "to":{"role":"team"},
  "message_type":"REVIEW", "payload":{ "verdict":"approve", "issues":[], "requests":[] } }

Tooling Guardrails for the Executor

Executors need strong guardrails to keep execution safe:

Real sandbox: containerized environment, CPU and memory quotas, network policies, filesystem isolation.
Tool adapters: whitelist of allowed functions with strict input schemas and output sanitization.
Content filters: block code that exfiltrates secrets, spawns long-running processes, or writes outside sandbox.

In addition, record all code snippets and outputs for audit; this is invaluable during incident response.

Integrating with External Systems

Multi-agent teams often need to interact with databases, APIs, and internal systems. Keep the integration surface minimal and explicit:

Wrap external APIs as first-party tools with typed schemas and business-purpose descriptions.
Limit credentials to the executor, scoped to the least privilege necessary.
Use read-only APIs for the planner and reviewer.

Example: Safe Data Access Tool (Executor Only)

SAFE_QUERY_TOOL = {
    "type": "function",
    "function": {
        "name": "safe_sql_query",
        "description": "Execute parameterized read-only SQL queries",
        "parameters": {
            "type": "object",
            "properties": {
                "sql": {"type": "string", "description": "SELECT-only query with LIMIT"},
                "params": {"type": "object"}
            },
            "required": ["sql"]
        }
    }
}

Attach this tool only to the executor. The reviewer can be given a “schema validator” tool that accepts artifacts and checks them for adherence to a JSON schema, but cannot directly query data.

Governance: Ownership, Audits, and Compliance

Treat agent definitions, team policies, and routing logic as core infrastructure subject to review and change control.

Code-review agent prompts and policies; lint for contradictions and missing constraints.
Run policy audits: ensure no role has broader tool access than documented.
Maintain audit trails: immutable logs of plans, executions, reviews, and policy decisions.

FAQ: Common Pitfalls and Remedies

Planner generating code? Remove any execution tools from planner; reinforce instructions; add a denylist phrase detector.
Executor attempting to “optimize the plan”? Add schema checks to reject PLAN output from executor; block plan channel for executor.
Reviewer stalling? Set explicit SLAs and implement a watchdog timer to escalate to a human or auto-approve low-risk steps.
Escalation loops? Enforce delegation DAG; reject cycles on ingress; cap replan iterations.

For deeper methodologies on prompt discipline and robust evaluation frameworks, see For a deeper exploration of this topic, our comprehensive guide on The Codex Microservices Playbook: 20 Prompts for Designing, Implementing, and Testing Distributed Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section. and For a deeper exploration of this topic, our comprehensive guide on The Complete Guide to Google’s Agent2Agent Protocol and OpenAI Codex Interoperability: Building Cross-Platform AI Agent Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section..

Appendix: Expanded Code for a Full PER Team Session

Below is a more complete script that integrates schema validation, policy enforcement, and basic telemetry. It uses the compatibility orchestrator to keep the logic fully explicit.

import json
import time
import uuid
from typing import Dict, Any, List, Tuple
from pydantic import ValidationError

from openai import OpenAI
client = OpenAI()

def now_ms():
    return int(time.time() * 1000)

def policy_check_message(role: str, to_role: str, channel: str, payload: Dict[str, Any]) -> Tuple[bool, str]:
    text = json.dumps(payload)
    ok, reason = enforce_policy(role, to_role, channel, text)
    if not ok:
        return ok, reason
    valid, err = validate_message(role, payload)
    if not valid:
        return False, f"schema_violation: {err}"
    return True, "ok"

def orchestrate_per(user_task: str) -> Dict[str, Any]:
    session_id = str(uuid.uuid4())
    t0 = now_ms()
    # 1) Plan
    plan = call_planner(user_task)
    ok, reason = policy_check_message("planner", "executor", "plan", plan)
    if not ok:
        return {"status": "aborted", "stage": "plan", "reason": reason}

    # 2) Execute
    artifacts = {"message_type": "ARTIFACTS", "session_id": session_id, "items": []}
    for step in plan["steps"]:
        step_t0 = now_ms()
        art = call_executor(step, context={"user_task": user_task})
        ok, reason = policy_check_message("executor", "reviewer", "execute", art)
        if not ok:
            return {"status": "aborted", "stage": "execute", "step": step["id"], "reason": reason}
        art["latency_ms"] = now_ms() - step_t0
        artifacts["items"].append(art)

    # 3) Review
    review = call_reviewer(artifacts, plan)
    ok, reason = policy_check_message("reviewer", "team", "review", review)
    if not ok:
        return {"status": "aborted", "stage": "review", "reason": reason}

    if review["verdict"] == "approve":
        return {
            "status": "ok",
            "latency_ms": now_ms() - t0,
            "plan": plan,
            "artifacts": artifacts,
            "review": review
        }
    else:
        # Merge requested fixes into a new plan request
        replan_req = {
            "message_type": "REPLAN_REQUEST",
            "issues": review.get("issues", []),
            "requests": review.get("requests", [])
        }
        return {
            "status": "needs_replan",
            "latency_ms": now_ms() - t0,
            "plan": plan,
            "artifacts": artifacts,
            "review": review,
            "replan_request": replan_req
        }

if __name__ == "__main__":
    user_csv = "region,month,sales\nEMEA,1,100\nNA,1,200\nEMEA,2,150\nNA,2,180\nAPAC,1,90\nAPAC,2,95"
    task = f"Given this CSV, compute per-region totals and a short summary.\nCSV:\n{user_csv}"
    result = orchestrate_per(task)
    print(json.dumps(result, indent=2))

Conclusion and Next Steps

Multi-agent teams unlock reliability, scale, and clarity when building complex AI workflows. OpenAI’s agent-team feature formalizes these patterns by giving you first-class roles, delegation rules, and policy enforcement. The Planner–Executor–Reviewer triad provides a robust default architecture:

Planner defines the work with clear acceptance criteria.
Executor performs tool-mediated steps within a strict sandbox.
Reviewer enforces correctness and policy compliance, acting as the project’s gatekeeper.

Preventing role conflicts is not one technique but a layered approach: system prompts, tool scoping, team policies, and runtime checks. With structured communication schemas, strong error handling, and comprehensive observability, you can operate multi-agent teams confidently in production.

As you expand, consider more sophisticated topologies (multiple executors, specialist reviewers), integrate with enterprise systems via tightly-scoped tools, and invest in testing harnesses that simulate edge cases. For broader architectural guidance and adjacent techniques, explore For a deeper exploration of this topic, our comprehensive guide on The Codex Application Modernization Playbook: 15 Prompts for Migrating Legacy Systems to Cloud-Native Architecture provides detailed strategies and implementation frameworks that complement the approaches discussed in this section. and For a deeper exploration of this topic, our comprehensive guide on The Complete Guide to Google’s Agent2Agent Protocol and OpenAI Codex Interoperability: Building Cross-Platform AI Agent Systems provides detailed strategies and implementation frameworks that complement the approaches discussed in this section..

Access 40,000+ AI Prompts for ChatGPT, Claude & Codex — Free!

Subscribe to get instant access to our complete Notion Prompt Library — the largest curated collection of prompts for ChatGPT, Claude, OpenAI Codex, and other leading AI models. Optimized for real-world workflows across coding, research, content creation, and business.

Subscribe & Get Free Access →

Markos Symeonides

Why ChatGPT’s Futures Class of 2026 Signals OpenAI’s Pivot to Developer Education — And What It Means for the AI Talent Pipeline

Reading Time: 21 minutes

Why ChatGPT’s Futures Class of 2026 Signals OpenAI’s Pivot to Developer Education — And What It Means for the AI Talent Pipeline For years, OpenAI’s strategy was straightforward: ship increasingly capable models, wrap them in a stable API, and let…

The Codex Database Engineering Playbook: 20 Prompts for Schema Design, Query Optimization, Migration Scripts, and Data Pipeline Automation

Reading Time: 30 minutes

The Codex Database Engineering Playbook: 20 Prompts for Schema Design, Query Optimization, Migration Scripts, and Data Pipeline Automation This playbook provides twenty production-grade, copy-ready Codex prompts to accelerate high-impact database engineering tasks across schema design, query optimization, migration engineering, and…

45 ChatGPT-5.5 Prompts for Technical Writers: API Documentation, SDK Guides, Release Notes, and Developer Tutorials

Reading Time: 37 minutes

45 ChatGPT-5.5 Prompts for Technical Writers: API Documentation, SDK Guides, Release Notes, and Developer Tutorials This masterclass provides 45 production-ready ChatGPT-5.5 prompts crafted specifically for technical writers and developer experience teams. Each prompt is detailed, contextualized, and accompanied by usage…

The 2026 AI Coding Agent Comparison: Cursor vs Claude Code vs GitHub Copilot vs OpenAI Codex vs Windsurf vs Devin

Reading Time: 23 minutes

The 2026 AI Coding Agent Comparison: Cursor vs Claude Code vs GitHub Copilot Agent Mode vs OpenAI Codex vs Windsurf vs Devin The coding agent ecosystem in 2026 has moved from autocomplete-centric tooling to deeply agentic systems that can plan,…

How to Build Multi-Agent Teams with OpenAI’s Agent-Team Feature: Preventing Role Conflicts and Managing Agent Hierarchy

How to Build Multi-Agent Teams with OpenAI’s Agent-Team Feature: Preventing Role Conflicts and Managing Agent Hierarchy

Table of Contents

Introduction: Why Multi-Agent Teams and Why Now

Multi-Agent Architectures: Concepts and Patterns

Common Patterns

Key Risks in Multi-Agent Designs

OpenAI’s Agent-Team Feature: Concepts, Capabilities, and Mechanics

Core Concepts

Lifecycle

SDK Snapshot (Python)

Designing Agent Roles and Responsibilities

Boundary-Centric Instructions

Tool Permission Scoping

Preventing Role Conflicts and Role Drift

1) Instruction-Level Constraints

2) Tool Boundaries

3) Team-Level Policies

4) Runtime Checks and Auditing

Leader-Escalation and Role-Drift Detectors

Implementing Hierarchy and Delegation Patterns

Topologies

Delegation Rules

Termination Conditions

Communication Protocols Between Agents

Message Envelope

Channels

Structured Payload Schemas

Error Handling and Resilience in Multi-Agent Systems

Error Taxonomy

Strategies

Practical Examples with Python: 3-Agent Team (Planner, Executor, Reviewer)

Prerequisites

Environment Setup

A) Declaring a Team with Agent-Team Primitives

B) Compatibility Orchestrator (Manual Routing with Agents)

Role Boundary Highlights in the Example

Extensions

Testing Multi-Agent Workflows

Unit Tests for Policies and Prompts

Simulation and Golden-File Tests

Property-Based Tests

Behavioral Evaluations

Monitoring, Logging, and Observability

Structured Logs

Distributed Tracing (OpenTelemetry)

Key Metrics

Production Deployment Considerations

Scaling and Concurrency

Rate Limits and Quotas

Security and Safety

Cost Management

Versioning and Change Control

Advanced Guardrails

Disaster Recovery and Graceful Degradation

Worked Example Deep Dive: Enforcing Boundaries End-to-End

Prompt Design and Structured Outputs

Policy Hooks on Ingress and Egress

Timeouts, Retries, and Idempotency

Backpressure and Flow Control

Advanced Communication Patterns

Sample End-to-End Exchange

Tooling Guardrails for the Executor

Integrating with External Systems

Example: Safe Data Access Tool (Executor Only)

Governance: Ownership, Audits, and Compliance

FAQ: Common Pitfalls and Remedies

Appendix: Expanded Code for a Full PER Team Session

Conclusion and Next Steps

Access 40,000+ AI Prompts for ChatGPT, Claude & Codex — Free!

Get Free Access to 40,000+ AI Prompts for ChatGPT, Claude & Codex

More on this