How to Build a Code Review Bot with Claude Sonnet 4.6 in 2026: Step-by-Step
⚡ TL;DR — Key Takeaways
- What it is: A comprehensive, production-ready guide to building an automated code review bot utilizing Claude Sonnet 4.6’s advanced capabilities, covering webhook ingestion, diff parsing, structured prompt design, and GitHub API integration.
- Who it’s for: Backend and platform engineers aiming to reduce pull request (PR) review latency, handle AI-generated code diffs at scale, and deploy a reliable automated reviewer that integrates seamlessly into their workflow.
- Key takeaways: Leveraging Claude Sonnet 4.6’s 1M-token context window eliminates chunking complexities; prompt design drives 70% of review quality; a modular five-stage pipeline ensures scalable, testable components.
- Pricing/Cost: At $3 per million input tokens and $15 per million output tokens, Claude Sonnet 4.6 offers an economical, high-quality solution for teams processing thousands of diffs weekly, at roughly one-fifth the cost of Claude Opus 4.7.
- Bottom line: For teams shipping AI-generated code at scale, automated review is essential. This guide equips you to build a working version within a day and a hardened deployment in a week.
✓ Instant access✓ No spam✓ Unsubscribe anytime
Why Code Review Bots Became Essential in 2026
In 2026, software engineering workflows have evolved rapidly, with AI-assisted coding tools such as GitHub Copilot contributing up to 40-60% of committed code changes. According to GitHub’s Octoverse report for 2026, the median pull request (PR) review latency across enterprise repositories is 19 hours, creating a bottleneck in continuous delivery pipelines.
Human reviewers spend a significant portion of their time scrutinizing subtle bugs in AI-generated diffs, a task well-suited for automation. Code review bots, especially those leveraging advanced large language models (LLMs), have become table stakes in modern engineering organizations to maintain velocity without compromising code quality.
Claude Sonnet 4.6, released by Anthropic in February 2026, stands out as an optimal model for this application due to its balance of cost, context window size, and accuracy. It offers a 1 million token context window in beta, enabling it to analyze entire microservices in a single request, a crucial capability for comprehensive code reviews.
Additionally, Claude Sonnet 4.6 is priced at $3 per million input tokens and $15 per million output tokens—about one-fifth the cost of Claude Opus 4.7—making it economically viable for bots processing thousands of diffs weekly.
This guide walks you through building a production-grade code review bot using Claude Sonnet 4.6. We’ll cover everything from webhook ingestion and diff parsing to structured prompting and posting comments via the GitHub API. By the end, you’ll understand how to ship a working prototype in a day and a hardened deployment within a week.
Prerequisites: You will need a GitHub organization with admin access, an Anthropic API key with Sonnet 4.6 enabled, Python 3.11+, a publicly accessible HTTPS endpoint (Cloudflare Workers, AWS Lambda, Google Cloud Run, Fly.io, etc.), and a commitment to iterative prompt tuning.
Prompt design is critical: while model selection accounts for about 30% of the review quality, 70% depends on how you assemble context and craft prompts.
For more on practical implementation patterns, check out our deep dive in Multi-Agent Prompting with Claude 4.6 Sonnet: Building AI Code Review Pipelines That Actually Work.
Architecture Overview: How the Bot Views a Pull Request
The bot architecture employs a modular five-stage pipeline, each component independently testable and replaceable. This separation simplifies debugging and facilitates iterative improvements.
- Webhook Ingestion: GitHub sends
pull_requestevents (actions:opened,synchronize,ready_for_review) to your HTTPS endpoint. The bot verifies the HMAC signature (viaX-Hub-Signature-256) to authenticate payloads. - Diff and Context Fetch: The bot retrieves the unified diff, full versions of changed files, PR metadata, linked issues, and recent commit history. Pulling full file content is crucial to provide the model with sufficient context for accurate analysis.
- Prompt Assembly: A system prompt defines the reviewer persona, priorities, and output schema. The user message includes the diff, full file contents, repo metadata, and the team’s review checklist. Prompt caching is used aggressively on static parts to reduce costs.
- Model Call: The assembled prompt is sent to Claude Sonnet 4.6 with JSON schema constraints on output, max tokens set to 8000, temperature 0.2, and extended thinking enabled with a 4000-token budget for complex PRs.
- Comment Posting: The bot parses the structured JSON response, deduplicates findings against prior comments, and posts line-anchored review comments via GitHub’s pull request reviews API.
Context Definition: Unlike naive bots that send only diffs, this bot sends full changed files plus relevant dependents such as callers, test files, and style guides (e.g., CONTRIBUTING.md). Using Sonnet 4.6’s 1M-token window, this fits comfortably in PRs with up to ~3000 lines changed, covering about 95% of real-world PRs.
Handling Large PRs: For massive refactors or generated code, the bot employs triaging. It skips lockfiles (e.g., package-lock.json) and generated code markers (@generated, **/generated/**). For very large PRs, it first summarizes files with Claude Haiku 4.5, then reviews only high-risk files with Sonnet 4.6.
Incremental Reviews: To avoid re-reviewing unchanged code or its own comments, the bot tracks last reviewed commit SHAs in a KV store (Redis, DynamoDB, Cloudflare KV). It processes only incremental diffs on subsequent pushes, reducing costs by 60-80% and avoiding inconsistent feedback.
Why Claude Sonnet 4.6?
| Model | Input $/M | Output $/M | SWE-bench Verified | Context Window | Best For |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | $3 | $15 | 77.2% | 1M tokens (beta) | General code review, large repos |
| Claude Opus 4.7 | $5 | $25 | 81.4% | 500K tokens | Security-critical reviews |
| Claude Haiku 4.5 | $1 | $5 | 68.9% | 200K tokens | Lint-style fast checks |
| GPT-5.2-codex | $1.25 | $10 | 74.5% | 400K tokens | Codex-trained, OpenAI shops |
| GPT-5.5 | $5 | $30 | 78.6% | 1.05M tokens | Long-context reasoning, premium |
| Gemini 3.1 Pro Preview | $2 | $12 | 72.1% | 1M tokens | Cost-sensitive, GCP-native |
Claude Sonnet 4.6 excels in cost-to-quality ratio for general code review because of Anthropic’s training data emphasis on software engineering. It reliably detects subtle bugs like off-by-one errors, missing error handling, and concurrency issues more effectively than similarly priced alternatives.
Claude Opus 4.7 is recommended for security-critical code like authentication or payment processing where missing bugs has catastrophic consequences. For most other use cases, Sonnet 4.6 offers the best balance of price and performance.
Step-by-Step Build: From Webhook to Posted Review
Below is a minimal but production-ready Python implementation of the code review bot. The example uses FastAPI with Uvicorn but can be adapted to other HTTP runtimes like AWS Lambda or Cloudflare Workers.
Step 1: Set Up the GitHub App
Create a GitHub App (not a personal access token) for better scalability and scoped permissions.
- Permissions: Pull requests (read & write), Contents (read), Metadata (read)
- Events: Subscribe to Pull request events
- Generate a private key and webhook secret; securely store these in your secret manager
Install the app on target repositories. Each installation yields an installation_id. Mint short-lived installation access tokens by signing a JWT with your private key and exchanging it at /app/installations/{id}/access_tokens. Tokens expire after one hour and should be cached with a 50-minute TTL.
Step 2: The Webhook Handler
import hmac, hashlib, os, json
from fastapi import FastAPI, Request, HTTPException, BackgroundTasks
import httpx
from anthropic import Anthropic
app = FastAPI()
anthropic = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
WEBHOOK_SECRET = os.environ["GH_WEBHOOK_SECRET"].encode()
def verify_signature(payload: bytes, signature: str) -> bool:
expected = "sha256=" + hmac.new(WEBHOOK_SECRET, payload, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, signature)
@app.post("/webhook")
async def webhook(request: Request, bg: BackgroundTasks):
body = await request.body()
sig = request.headers.get("X-Hub-Signature-256", "")
if not verify_signature(body, sig):
raise HTTPException(401, "Bad signature")
event = request.headers.get("X-GitHub-Event")
payload = json.loads(body)
if event == "pull_request" and payload["action"] in ("opened", "synchronize", "ready_for_review"):
if payload["pull_request"]["draft"]:
return {"skipped": "draft"}
bg.add_task(review_pr, payload)
return {"ok": True}
Note: The background task offloads model calls since GitHub expects a 2xx response within 10 seconds. Model inference can take 15-90 seconds for complex PRs.
Step 3: Fetch the Diff and File Context
async def fetch_pr_context(payload, token):
repo = payload["repository"]["full_name"]
number = payload["pull_request"]["number"]
headers = {"Authorization": f"Bearer {token}", "Accept": "application/vnd.github+json"}
async with httpx.AsyncClient(timeout=30) as client:
diff = (await client.get(
f"https://api.github.com/repos/{repo}/pulls/{number}",
headers={**headers, "Accept": "application/vnd.github.v3.diff"}
)).text
files_resp = await client.get(
f"https://api.github.com/repos/{repo}/pulls/{number}/files?per_page=100",
headers=headers
)
files = files_resp.json()
full_files = {}
for f in files:
if f["status"] == "removed" or f["filename"].endswith((".lock", ".min.js")):
continue
if f["changes"] > 2000:
continue
content = (await client.get(f["raw_url"], headers=headers)).text
full_files[f["filename"]] = content
return {"diff": diff, "files": full_files, "pr": payload["pull_request"]}
Fetching full files rather than just diffs provides the model richer context to detect incorrect API usage and subtle bugs.
For cost-quality trade-offs and detailed analysis, see Claude Opus 4.7 for Production AI Code Review in 2026.
Step 4: Build the Prompt and Call Sonnet 4.6
Prompt engineering is the cornerstone of effective automated code reviews. Be explicit about priorities, output format, and what to ignore.
SYSTEM_PROMPT = """You are a senior staff engineer reviewing a pull request. Your job is to catch real bugs, security issues, and design problems — not to nitpick style.
REVIEW PRIORITIES (in order):
1. Correctness bugs: off-by-one, null dereference, race conditions, wrong API usage
2. Security: injection, auth bypass, secrets in code, unsafe deserialization
3. Resource issues: leaks, unbounded loops, N+1 queries, missing timeouts
4. Design: violations of existing patterns in the repo, unclear abstractions
5. Test gaps: untested error paths, missing edge cases
DO NOT COMMENT ON:
- Code style or formatting (linters handle this)
- Naming preferences unless genuinely misleading
- Suggestions that are stylistic rather than substantive
- Code that was not changed in this PR (unless directly relevant)
For each finding, output a JSON object with: file, line, severity (blocker|major|minor), category, comment, suggested_fix (optional).
Return ONLY a JSON object: {"summary": "...", "findings": [...]}. If the PR looks good, return an empty findings array with a brief positive summary."""
async def review_pr(payload):
token = await get_installation_token(payload["installation"]["id"])
ctx = await fetch_pr_context(payload, token)
files_block = "\n\n".join(
f"=== {name} ===\n{content}" for name, content in ctx["files"].items()
)
user_message = f"""PR Title: {ctx['pr']['title']}
PR Description: {ctx['pr']['body'] or '(none)'}
UNIFIED DIFF:
{ctx['diff']}
FULL CONTENT OF CHANGED FILES:
{files_block}
"""
response = anthropic.messages.create(
model="claude-sonnet-4-6-20260215",
max_tokens=8000,
temperature=0.2,
system=[
{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}
],
messages=[{"role": "user", "content": user_message}],
thinking={"type": "enabled", "budget_tokens": 4000}
)
review = json.loads(response.content[-1].text)
await post_review(ctx, review, token)
Key details:
- Prompt Caching: The system prompt is cached aggressively to reduce costs by ~90% on repeat calls.
- Extended Thinking: Enabled with 4000-token budget improves detection of subtle bugs.
- Temperature: Set to 0.2 to introduce slight variability that uncovers issues deterministic decoding may miss.
Step 5: Post Line-Anchored Comments
async def post_review(ctx, review, token):
repo = ctx["pr"]["base"]["repo"]["full_name"]
number = ctx["pr"]["number"]
sha = ctx["pr"]["head"]["sha"]
comments = []
for f in review.get("findings", []):
if f["severity"] == "minor" and len(review["findings"]) > 15:
continue
body = f"**[{f['severity'].upper()}] {f['category']}**\n\n{f['comment']}"
if f.get("suggested_fix"):
body += f"\n\n```suggestion\n{f['suggested_fix']}\n```"
comments.append({
"path": f["file"],
"line": f["line"],
"side": "RIGHT",
"body": body
})
payload = {
"commit_id": sha,
"body": review.get("summary", "Review complete."),
"event": "COMMENT",
"comments": comments
}
async with httpx.AsyncClient() as client:
await client.post(
f"https://api.github.com/repos/{repo}/pulls/{number}/reviews",
headers={"Authorization": f"Bearer {token}"},
json=payload
)
Minor finding suppression helps prevent review fatigue. Cap total findings around 10-12, prioritizing higher severity issues.
Hardening: From Demo to Production-Grade Bot
While the minimal code above runs end-to-end on first push, production use requires additional robustness to handle false positives, scale cost-effectively, and integrate with team workflows.
False Positive Suppression
Track each finding’s file + line + category in a vector store. Use semantic similarity embeddings (Voyage-3 model at $0.06/M tokens) to detect near-duplicates before posting. If a finding triggers repeatedly across PRs without code changes, add it to a known_false_positives list injected into the system prompt as “do not flag.”
Self-Critique Pass
For blocker findings, perform a second verification call to Claude Sonnet 4.6 asking: “Is this finding correct? Reproduce or justify it.” This reduces blocker false positives by ~50% with minimal cost overhead.
Cost Controls
- Set daily and per-PR spending caps (e.g., $0.50 average, $2 ceiling per PR, $100 daily org-wide)
- Log token usage per call to monitoring platforms (Datadog, Grafana, CloudWatch)
- Optimize context assembly to exclude irrelevant files if costs rise
Retry Logic
Handle API 429 (rate limit) and 529 (overload) errors with exponential backoff (3 retries max, jittered, starting at 2 seconds). Use retry-after headers to respect rate limits.
Repository Configuration
Allow teams to add a .review-bot.yml config file with options like severity_threshold, ignore_paths, focus_areas, and custom_rules. The bot reads and injects these settings into prompts, enabling tailored behavior without developer tickets.
Handling Massive PRs
For PRs exceeding 50,000 tokens of context, implement a two-pass approach:
- Use Claude Haiku 4.5 to summarize each file and identify risky changes.
- Review only high-risk files with Sonnet 4.6.
This reduces costs by ~70% for large PRs with minimal quality loss.
Observability
Log metadata on each review: PR ID, model version, token counts (input/output/cache), latency, findings count by severity, and human feedback labels (“fixed”, “won’t fix”, “false positive”). Use this data to tune prompts and improve precision over time.
A Concrete Weekly Tuning Loop
- Extract the last 100 reviews sorted by finding category.
- Sample 5 findings per category, label as true positive (TP), noisy true positive (TP-noisy), or false positive (FP).
- Calculate precision per category. Demote or remove categories with precision below 60% from the prompt.
- Tighten “DO NOT COMMENT ON” list for noisy categories with concrete examples.
- Re-run the bot on 10 historical PRs with known bugs to confirm no regressions.
Following this process improves review quality faster than theoretical prompt engineering alone.
Real-World Deployment Results After Three Months
A mid-sized SaaS engineering team (35 engineers, 4 services, 180 PRs per week) deployed this architecture starting January 2026. Their metrics after three months are illustrative:
- Cost: $327 monthly Anthropic API spend, averaging $0.41 per PR reviewed. The 95th percentile PR cost $1.20. Prompt caching achieved 84% hit rate on system prompts after the first week.
- Latency: Median time to post first bot comment was 38 seconds after PR open; 95th percentile was 94 seconds, primarily due to extended thinking on large PRs. This latency is fast enough that reviewers see bot feedback before manual review starts.
- Quality: After two tuning iterations, blocker severity precision reached 81%, major severity 68%, and minor severity 44%. Minor findings are suppressed aggressively to reduce noise. Estimated recall based on post-merge bugs caught is 35-45%.
- Developer Reception: Internal surveys showed 78% of engineers preferred to keep the bot enabled, 14% neutral, 8% wanted it disabled. Common complaints involved occasional false positives on new coding patterns and occasional missed edge cases.
Overall, the bot significantly reduced manual review effort and accelerated shipping cycles while maintaining developer trust.
🕐 Instant∞ Unlimited🎁 Free
Useful Links
- → Multi-Agent Prompting with Claude 4.6 Sonnet: Building AI Code Review Pipelines That Actually Work
- → Claude Opus 4.7 for Production AI Code Review in 2026
- → How to Build a Research Assistant with OpenAI Codex in 2026: Step-by-Step
- → Claude Code vs OpenAI Codex in 2026: The Definitive Comparison for Professional Developers
- → Anthropic Claude Model Pricing and Documentation
- → Anthropic Research on AI Safety and Multi-step Reasoning
Frequently Asked Questions
Why choose Claude Sonnet 4.6 over Claude Opus 4.7 for code review bots?
Claude Sonnet 4.6 costs roughly one-fifth of Opus 4.7 while scoring within 4 points on SWE-bench Verified (77.2%). For bots processing thousands of diffs weekly, that cost-to-performance ratio is more practical than chasing the highest raw capability ceiling.
How does the 1M-token context window improve automated code reviews?
It lets you load an entire microservice — source files, tests, commit history, and style guides — into a single request. This eliminates chunking heuristics and lost-in-the-middle failures that plague large PR reviews when using models with smaller context windows.
What are the five pipeline stages in the code review bot architecture?
The pipeline covers webhook ingestion with HMAC verification, diff and full-file context fetching, prompt assembly with caching, a Claude Sonnet 4.6 model call with JSON schema output constraints, and comment posting to GitHub via the API. Each stage is independently testable.
How should temperature and token settings be configured for Sonnet 4.6?
Set temperature to 0.2 for consistent, deterministic review output and max_tokens to 8000. For non-trivial PRs, enable extended thinking with a 4000-token budget to allow the model to reason through complex diffs before producing structured feedback.
What prerequisites do you need before building this code review bot?
You need a GitHub organization with admin access, an Anthropic API key with Sonnet 4.6 enabled, Python 3.11+, a publicly reachable HTTPS endpoint (Cloudflare Workers, Lambda, Cloud Run, or Fly.io), and readiness to iterate on prompts throughout development.
How significant is prompt design compared to model selection for review quality?
According to the guide, prompt design and context assembly account for roughly 70% of code review quality, while model selection contributes about 30%. Investing in well-structured system prompts, reviewer personas, and review checklists outweighs switching between models.
