How to Use OpenAI Codex for Automated Code Review: Setting Up PR Analysis, Security Scanning, and Performance Optimization Agents

June 29, 2026

How to Use OpenAI Codex for Automated Code Review: Setting Up PR Analysis, Security Scanning, and Performance Optimization Agents

Author: Markos Symeonides

Automated code review agents built on Codex-class models can catch defects earlier, enforce team standards consistently, and reduce reviewer fatigue on high-volume repositories. In this tutorial, you’ll build a production-ready workflow that uses OpenAI’s code-capable models (the Codex lineage embedded in modern GPT-4-class models) to analyze pull requests (PRs), perform security and performance scans, and integrate with GitHub/GitLab CI/CD. You’ll also learn how to craft robust prompts, manage cost, handle false positives, and embed your team’s rules. Real code examples are included in Node.js and Python with CI configurations for GitHub Actions and GitLab CI.

Overview and Architecture
Setting Up a Codex-Class Review Agent
Configuring PR Analysis and Prompt Design
Security Scanning: OWASP Top 10, Secrets, and Dependencies
Performance Optimization Checks
Integrations with GitHub Actions and GitLab CI
Custom Review Rules and Team Standards
Handling False Positives and Tuning Sensitivity
End-to-End Examples: Prompts and Outputs
Cost Management for High-Volume Repositories
Best Practices for Human-in-the-Loop Reviews
Observability, Metrics, and Continuous Evaluation
Next Steps

1) Overview and Architecture

Codex-class models excel at understanding and generating code. When embedded in a PR workflow, they can:

Summarize PRs and highlight risks
Detect security issues aligned to OWASP Top 10
Spot performance anti-patterns and suggest optimizations
Enforce team-specific conventions and quality gates
Draft actionable, line-level review comments

The reference architecture below balances review accuracy, cost, and maintainability.

Core components

Diff collector: Fetches the PR/MR diff, commit messages, and file metadata.
Chunker: Splits the diff into manageable, context-rich chunks by file/language.
Prompt builder: Constructs structured prompts (system + user), with a JSON schema for outputs.
Analyzer: Calls an OpenAI code-capable model with low temperature and structured output.
Post-processor: Validates JSON, filters low-confidence findings, maps to OWASP/perf categories.
Reporter: Posts review comments and a summary check to GitHub/GitLab, optionally fails on high severity.
Cache/Cost guard: Skips unchanged chunks, deduplicates, and enforces per-run token budgets.

2) Setting Up a Codex-Class Review Agent

Prerequisites

OpenAI API key in CI secrets (e.g., OPENAI_API_KEY)
GitHub or GitLab token with repo/MR read and comment scopes
Node.js 18+ or Python 3.9+

Model selection

Use a strong, code-capable GPT-4-class model for analysis and a lightweight model for pre-screens. For example:

Primary analysis: a GPT-4-class “Codex lineage” model suited for code review
Pre-screening: a cost-efficient mini model to flag files unlikely to need deep analysis

Set temperature to 0–0.2 for deterministic, audit-friendly results.

Security of secrets

Store OPENAI_API_KEY and repo tokens in CI secrets, not in code.
Scope tokens to read/write PRs only. Avoid full admin scopes.
Log only non-sensitive metadata. Never log raw diffs with secrets or tokens.

Node.js reference implementation

The script below fetches a GitHub PR diff, chunks it, sends it to the model, and posts comments. It outputs a single review summary with optional line comments.

/**
 * analyze-pr.js
 * Node.js reference implementation for a Codex-class code review agent.
 * Requires:
 *   - OPENAI_API_KEY (env)
 *   - GITHUB_TOKEN (env)
 *   - GITHUB_REPOSITORY (org/repo)
 *   - PR_NUMBER (number)
 */

import fs from "node:fs";
import path from "node:path";
import fetch from "node-fetch";
import { fileURLToPath } from "node:url";
import OpenAI from "openai";

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const GITHUB_TOKEN = process.env.GITHUB_TOKEN;
const REPO = process.env.GITHUB_REPOSITORY; // "org/repo"
const PR_NUMBER = process.env.PR_NUMBER; // e.g., 123

if (!OPENAI_API_KEY || !GITHUB_TOKEN || !REPO || !PR_NUMBER) {
  console.error("Missing required env vars: OPENAI_API_KEY, GITHUB_TOKEN, GITHUB_REPOSITORY, PR_NUMBER");
  process.exit(1);
}

const openai = new OpenAI({ apiKey: OPENAI_API_KEY });

// Choose a code-capable model
const MODEL = "gpt-4o-mini"; // Codex-class successor suitable for code analysis

async function getPullRequestFiles() {
  const url = `https://api.github.com/repos/${REPO}/pulls/${PR_NUMBER}/files`;
  const res = await fetch(url, {
    headers: {
      "Authorization": `token ${GITHUB_TOKEN}`,
      "Accept": "application/vnd.github+json"
    }
  });
  if (!res.ok) {
    throw new Error(`GitHub API error: ${res.status} ${await res.text()}`);
  }
  return await res.json();
}

function languageFromFilename(filename) {
  const ext = filename.split(".").pop();
  const map = {
    js: "JavaScript", ts: "TypeScript", py: "Python", rb: "Ruby", java: "Java",
    go: "Go", cs: "C#", php: "PHP", cpp: "C++", c: "C", rs: "Rust",
    kt: "Kotlin", swift: "Swift", scala: "Scala", sh: "Shell", yaml: "YAML",
    yml: "YAML", json: "JSON", md: "Markdown", sql: "SQL"
  };
  return map[ext] || "Unknown";
}

function buildSystemPrompt(config) {
  return `
You are an autonomous senior code reviewer (Codex-class). Output MUST be valid JSON matching the schema:

{
  "summary": { "overview": string, "risk_score": number, "categories": string[] },
  "findings": [
    {
      "id": string,
      "file": string,
      "line": number | null,
      "severity": "critical" | "high" | "medium" | "low" | "info",
      "category": "security" | "performance" | "maintainability" | "style" | "dependency",
      "title": string,
      "description": string,
      "evidence": string,
      "owasp": string[] | [],
      "tags": string[],
      "suggestion": { "type": "explanation" | "patch", "content": string }
    }
  ]
}

Strict requirements:
- Report only when evidence is strong. Avoid speculative findings.
- Use OWASP tags for security (e.g., "A01:2021-Broken Access Control").
- Point to specific lines when possible; otherwise line should be null.
- Group similar issues into one finding, not duplicates.
- Suggest minimal, safe changes. If unsure, ask for clarification in description.
- Follow team rules when provided.
${config?.rulesNote ? `\nTeam rules: ${config.rulesNote}\n` : ""}
`;
}

function buildUserPrompt({ filename, language, patch, additions, deletions }, options) {
  // Trim overly large patches to a cap (token/cost guard)
  const MAX_CHARS = options?.maxChars || 12000;
  let trimmedPatch = patch;
  if (patch && patch.length > MAX_CHARS) {
    trimmedPatch = patch.slice(0, MAX_CHARS) + "\n... [truncated]";
  }

  return `
Analyze the following ${language} diff and produce JSON per schema:

File: ${filename}
Changes: +${additions} / -${deletions}

Unified diff (git patch format):
${trimmedPatch}

Review modes enabled: ${options?.modes?.join(", ") || "security, performance, maintainability"}.

Focus:
- Security: OWASP Top 10 categories, injection, broken access control, insecure deserialization, SSRF, secrets.
- Performance: Hot-path allocations, N+1 queries, blocking I/O in async contexts, inefficient loops.
- Maintainability: Null handling, error handling contracts, confusing control flow, dead code.
- Dependency: Introduced/upgraded packages with known CVEs (if provided separately).

Constraints:
- If no issues, output an empty findings array with a clear summary.
- Be concise; avoid generic advice. Reference exact code snippets in evidence.
`;
}

async function analyzeChunk(chunk, config) {
  const response = await openai.chat.completions.create({
    model: MODEL,
    temperature: 0.1,
    response_format: { type: "json_object" },
    messages: [
      { role: "system", content: buildSystemPrompt(config) },
      { role: "user", content: buildUserPrompt(chunk, { modes: config?.modes, maxChars: config?.maxChars }) }
    ],
  });
  const text = response.choices[0].message.content || "{}";
  try {
    return JSON.parse(text);
  } catch (e) {
    // Fallback parsing: attempt to recover JSON
    const match = text.match(/\{[\s\S]*\}$/);
    return match ? JSON.parse(match[0]) : { summary: { overview: "Parse error", risk_score: 0, categories: [] }, findings: [] };
  }
}

function dedupeFindings(findings) {
  const seen = new Set();
  return findings.filter(f => {
    const key = `${f.file}:${f.line || 0}:${(f.title || "").slice(0,50)}:${f.category}`;
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  });
}

async function postReviewSummary(body) {
  const url = `https://api.github.com/repos/${REPO}/issues/${PR_NUMBER}/comments`;
  const res = await fetch(url, {
    method: "POST",
    headers: {
      "Authorization": `token ${GITHUB_TOKEN}`,
      "Accept": "application/vnd.github+json",
      "Content-Type": "application/json"
    },
    body: JSON.stringify({ body })
  });
  if (!res.ok) {
    console.error(`Failed to post summary: ${res.status} ${await res.text()}`);
  }
}

async function postReviewComment(file, line, body) {
  // PR review comment API requires a commit_id and position for diff; for simplicity, post as issue comment with file context.
  const formatted = `File: ${file}${line ? ` (line ${line})` : ""}\n\n${body}`;
  await postReviewSummary(formatted);
}

function renderSummaryReport(allSummaries, combinedFindings) {
  const total = combinedFindings.length;
  const bySeverity = { critical: 0, high: 0, medium: 0, low: 0, info: 0 };
  combinedFindings.forEach(f => { bySeverity[f.severity] = (bySeverity[f.severity] || 0) + 1; });

  const topCategories = {};
  combinedFindings.forEach(f => { topCategories[f.category] = (topCategories[f.category] || 0) + 1; });
  const cats = Object.entries(topCategories).sort((a,b)=>b[1]-a[1]).slice(0,5).map(([k,v]) => `${k}(${v})`).join(", ");

  return [
    "Automated Codex-Class Review Summary",
    "",
    `Findings: ${total} (critical:${bySeverity.critical}, high:${bySeverity.high}, medium:${bySeverity.medium}, low:${bySeverity.low})`,
    `Top categories: ${cats || "none"}`,
    "",
    ...allSummaries.map((s, i)=>`Part ${i+1}: ${s.summary?.overview || ""}`)
  ].join("\n");
}

async function main() {
  const files = await getPullRequestFiles();
  const chunks = files
    .filter(f => f.status !== "removed")
    .filter(f => !/\.lock$/.test(f.filename)) // skip lockfiles to save cost (or handle separately)
    .map(f => ({
      filename: f.filename,
      language: languageFromFilename(f.filename),
      patch: f.patch || "",
      additions: f.additions,
      deletions: f.deletions
    }))
    .filter(f => f.patch);

  const config = {
    modes: ["security", "performance", "maintainability"],
    maxChars: 24000,
    rulesNote: "No secrets in code; enforce async I/O in Node; avoid raw SQL strings outside ORM; prefer parameterized queries."
  };

  const results = [];
  for (const chunk of chunks) {
    const result = await analyzeChunk(chunk, config);
    results.push(result);
  }

  const allFindings = dedupeFindings(results.flatMap(r => r.findings || []));
  const summary = renderSummaryReport(results, allFindings);

  // Post summary
  await postReviewSummary("```\n" + summary + "\n```");

  // Optionally, post top N line comments
  const top = allFindings
    .filter(f => ["critical","high"].includes(f.severity))
    .slice(0, 10);

  for (const f of top) {
    const body = [
      `Severity: ${f.severity.toUpperCase()}`,
      `Category: ${f.category}`,
      `Title: ${f.title}`,
      `Description: ${f.description}`,
      f.owasp && f.owasp.length ? `OWASP: ${f.owasp.join(", ")}` : "",
      `Evidence:\n${f.evidence}`,
      `Suggestion (${f.suggestion.type}):\n${f.suggestion.content}`
    ].filter(Boolean).join("\n\n");
    await postReviewComment(f.file, f.line || null, body);
  }

  // Optionally fail CI on criticals
  const hasCritical = allFindings.some(f => f.severity === "critical");
  if (hasCritical) {
    console.error("Critical issues found by automated review.");
    process.exit(2);
  }
}

main().catch(err => {
  console.error(err);
  process.exit(1);
});

Why JSON output?

Strict JSON lets you parse, deduplicate, and enforce policies deterministically. You can convert JSON findings to PR comments, dashboards, and metrics without brittle string parsing.

Python reference (GitLab/MR-friendly)

# review_mr.py
# Python example for GitLab merge requests using a Codex-class model.

import os
import json
import requests
from openai import OpenAI

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GITLAB_TOKEN = os.getenv("GITLAB_TOKEN")
PROJECT_ID = os.getenv("CI_PROJECT_ID")
MR_IID = os.getenv("CI_MERGE_REQUEST_IID")
GITLAB_API = os.getenv("CI_API_V4_URL", "https://gitlab.com/api/v4")

if not (OPENAI_API_KEY and GITLAB_TOKEN and PROJECT_ID and MR_IID):
  raise SystemExit("Missing OPENAI_API_KEY, GITLAB_TOKEN, CI_PROJECT_ID, CI_MERGE_REQUEST_IID")

client = OpenAI(api_key=OPENAI_API_KEY)
MODEL = "gpt-4o-mini"

def get_mr_changes():
    url = f"{GITLAB_API}/projects/{PROJECT_ID}/merge_requests/{MR_IID}/changes"
    r = requests.get(url, headers={"PRIVATE-TOKEN": GITLAB_TOKEN})
    r.raise_for_status()
    return r.json()["changes"]

def language_from_filename(filename):
    ext = filename.split(".")[-1]
    return {
        "js":"JavaScript","ts":"TypeScript","py":"Python","rb":"Ruby","java":"Java","go":"Go",
        "cs":"C#","php":"PHP","cpp":"C++","c":"C","rs":"Rust","kt":"Kotlin","swift":"Swift",
        "scala":"Scala","sh":"Shell","yaml":"YAML","yml":"YAML","json":"JSON","md":"Markdown","sql":"SQL"
    }.get(ext, "Unknown")

def build_system_prompt():
    return """
You are a Codex-class autonomous reviewer. Output strict JSON:
{"summary":{"overview":string,"risk_score":number,"categories":string[]},"findings":[{"id":string,"file":string,"line":number|null,"severity":"critical"|"high"|"medium"|"low"|"info","category":"security"|"performance"|"maintainability"|"style"|"dependency","title":string,"description":string,"evidence":string,"owasp":string[],"tags":string[],"suggestion":{"type":"explanation"|"patch","content":string}}]}
Only report with strong evidence. Group duplicates. Use OWASP tags for security findings. Keep responses concise.
"""

def build_user_prompt(change):
    patch = change.get("diff") or ""
    if len(patch) > 24000:
        patch = patch[:24000] + "\n... [truncated]"
    return f"""
Analyze the following {language_from_filename(change["new_path"])} diff and produce JSON per schema:

File: {change["new_path"]}
Unified diff:
{patch}

Focus: security (OWASP), performance, maintainability. Only emit findings with evidence.
"""

def post_mr_note(body):
    url = f"{GITLAB_API}/projects/{PROJECT_ID}/merge_requests/{MR_IID}/notes"
    r = requests.post(url, headers={"PRIVATE-TOKEN": GITLAB_TOKEN}, json={"body": body})
    r.raise_for_status()

def main():
    changes = get_mr_changes()
    all_findings = []
    summaries = []
    for ch in changes:
        if ch.get("new_file") and not ch.get("diff"):
            continue
        messages = [
            {"role": "system", "content": build_system_prompt()},
            {"role": "user", "content": build_user_prompt(ch)}
        ]
        resp = client.chat.completions.create(
            model=MODEL, temperature=0.1, response_format={"type":"json_object"}, messages=messages
        )
        text = resp.choices[0].message.content
        try:
            data = json.loads(text)
        except Exception:
            continue
        summaries.append(data.get("summary", {}))
        all_findings.extend(data.get("findings", []))

    sev_order = {"critical":5,"high":4,"medium":3,"low":2,"info":1}
    all_findings.sort(key=lambda f: sev_order.get(f.get("severity","info"),1), reverse=True)

    lines = []
    by_sev = { "critical":0,"high":0,"medium":0,"low":0,"info":0}
    for f in all_findings:
        by_sev[f["severity"]] = by_sev.get(f["severity"],0)+1
    lines.append(f"Automated Codex-Class Review: {len(all_findings)} findings")
    lines.append(f"By severity: {by_sev}")
    for f in all_findings[:15]:
        lines.append(f"- [{f['severity'].upper()}] {f['file']}:{f.get('line') or '?'} - {f['title']}")
    post_mr_note("```\n" + "\n".join(lines) + "\n```")

    # Optional: fail pipeline if critical findings exist
    if any(f.get("severity") == "critical" for f in all_findings):
        raise SystemExit("Critical issues found")

if __name__ == "__main__":
    main()

3) Configuring PR Analysis and Prompt Design

What to review in a PR

Security: input validation, authz/authn boundaries, injection risks, secret exposure
Performance: algorithmic complexity, hot-path allocations, blocking I/O, N+1 queries
Correctness: null handling, error propagation, concurrency hazards
Maintainability: readability, modularity, dead code, excessive nesting
Style: lint rules not auto-fixable, team conventions
Dependencies: new/updated packages and known vulnerabilities

Prompt structure

Use a system message to set strict behavior and a user message per file/diff. Embed a JSON schema explicitly, specifying fields like severity, category, and suggestion. Keep temperature low and require evidence quotes.

System prompt template (reusable)

{
  "role": "system",
  "content": "You are a Codex-class autonomous reviewer. Output strict JSON with fields: summary (overview, risk_score, categories), findings (id, file, line, severity, category, title, description, evidence, owasp, tags, suggestion). Report only with strong evidence; avoid duplicates; use OWASP tags where applicable."
}

User prompt template (diff-focused)

{
  "role": "user",
  "content": "Analyze the following <LANGUAGE> diff for security, performance, and maintainability. Provide JSON findings with line numbers and minimal safe suggestions. Unified diff:\\n<PATCH>"
}

Chunking strategy

Chunk by file, never across languages. Maintain per-file context.
Trim patches beyond a safe size and mark as truncated. Add a second pass for large files if needed.
Skip known generated files (e.g., .snap, .lock, yarn.lock) unless dependency scanning is enabled.
Pass metadata: file path, language, additions/deletions, previous review history if available.

Deterministic parsing and validation

Use response_format: json_object to enforce JSON when supported.
Fallback with robust JSON extraction to handle occasional format drift.
Validate each finding: file exists, line is within changed range, severity is allowed.
Normalize categories and tags to a controlled vocabulary.

4) Security Scanning: OWASP Top 10, Secrets, and Dependencies

Codex-class models augment static scanners by reasoning about context and intent. Combine them with deterministic tools for best coverage.

Coverage map (selected OWASP Top 10)

OWASP Category	What to Ask the Model to Detect	Signal Examples
A01:2021 Broken Access Control	Missing authz checks on sensitive paths, direct object references, role bypass	Routes with admin operations but no role checks; exposed IDs used directly in DB queries
A03:2021 Injection	String-concatenated SQL, command injection, template injection	Raw SQL with user input; shell execution of user-supplied args
A05:2021 Security Misconfiguration	Disabling HTTPS checks, permissive CORS, verbose error leaks	CORS “*”, stack traces returned in API
A07:2021 Identification and Authentication Failures	Weak password checks, missing MFA enforcement, session fixation	Passwords logged or compared insecurely
A08:2021 Software and Data Integrity Failures	Unpinned dependencies, unsigned plugin loading, unsafe deserialization	Dynamic import of untrusted packages, pickle.loads on user input
A09:2021 Security Logging and Monitoring Failures	Missing audit logs for critical actions	No logging on privilege changes
A10:2021 Server-Side Request Forgery (SSRF)	External fetch of user-provided URLs without allowlists	HTTP clients invoked with unchecked URLs

Secrets scanning via prompt + regex pre-pass

Pre-scan diffs with lightweight regex for API keys, tokens, credentials.
Pass matches to the model to confirm if they are real secrets or test fixtures.
Require evidence and a redaction plan in suggestions.

// secret_scan.js (pre-pass example)
const SECRET_PATTERNS = [
  /AKIA[0-9A-Z]{16}/g, // AWS Access Key ID
  /aws_secret_access_key\s*=\s*["'][A-Za-z0-9\/+=]{40}["']/gi,
  /xox[baprs]-[0-9A-Za-z-]{10,48}/g, // Slack tokens
  /-----BEGIN (RSA|DSA|EC|PGP) PRIVATE KEY-----/g
];

export function findPotentialSecrets(patch) {
  const hits = [];
  SECRET_PATTERNS.forEach(rx => {
    const m = patch.match(rx);
    if (m) hits.push(...m);
  });
  return Array.from(new Set(hits));
}

Dependency vulnerability checks (OSV integration)

When the diff includes package manifests (package.json, requirements.txt, go.mod, pom.xml), gather new/updated dependencies and query a vulnerability database such as OSV. Feed any flagged CVEs into the model as evidence for a dependency-category finding.

Access 40,000+ AI Prompts for ChatGPT, Claude & Codex — Free!

Subscribe to get instant access to our complete Notion Prompt Library — the largest curated collection of prompts for ChatGPT, Claude, OpenAI Codex, and other leading AI models. Optimized for real-world workflows across coding, research, content creation, and business.

Get Free Access Now →

// osv_check.js
import fetch from "node-fetch";

export async function queryOSV(packageName, ecosystem, version) {
  // https://osv.dev schema
  const res = await fetch("https://api.osv.dev/v1/query", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ package: { name: packageName, ecosystem }, version })
  });
  if (!res.ok) return [];
  const data = await res.json();
  return data.vulns || [];
}

Feeding dependency issues to the model

function buildDependencyPrompt(depFindings) {
  return `
Evaluate the following updated dependencies. For any with known CVEs, generate 'dependency' findings with severity (high if exploitable), and suggestions (pin safe versions).

Dependencies with CVEs:
${JSON.stringify(depFindings, null, 2)}
`;
}

Security-focused prompt augmentations

Require OWASP tags for security findings.
Ask for exploitability conditions and remediation steps as evidence and suggestion.
Enforce “no speculation” by asking for exact code references and conditions.

5) Performance Optimization Checks

Codex-class analysis is powerful at surfacing performance anti-patterns with context-based suggestions. Calibrate your prompt to target your stack’s hot paths.

Common patterns to flag

N+1 DB queries in loops
Blocking I/O on async/event loops
Excessive intermediate allocations/copies
Inefficient regex/combinatorial parsing
Quadratic loops over large collections
Unbounded concurrency/fan-out without backpressure
String concatenation in tight loops (favor builders/joins)

Prompt snippet for performance emphasis

Performance focus:
- Identify hot-path risks (loops, handlers, middleware).
- Highlight blocking calls in async contexts and propose non-blocking alternatives.
- For DB code, detect per-item queries and suggest bulk operations or joins.
- Provide minimal diff-ready patches when safe.

Example: Node.js event loop blocking

diff --git a/server.js b/server.js
index 123..456 100644
--- a/server.js
+++ b/server.js
@@ -42,7 +42,13 @@ app.get("/report", async (req, res) => {
-  const data = fs.readFileSync("./large.json", "utf8"); // blocking on event loop
+  const data = await fs.promises.readFile("./large.json", "utf8"); // non-blocking
   const report = JSON.parse(data);
   res.json(report);
});

Expected model finding (excerpt)

{
  "id": "perf-node-blocking-001",
  "file": "server.js",
  "line": 43,
  "severity": "high",
  "category": "performance",
  "title": "Blocking file read on event loop",
  "description": "fs.readFileSync blocks the Node.js event loop under load.",
  "evidence": "Found fs.readFileSync(...) in a request handler.",
  "owasp": [],
  "tags": ["nodejs","event-loop","io"],
  "suggestion": { "type": "patch", "content": "Replace readFileSync with await fs.promises.readFile(...)" }
}

Example: Python N+1 query

diff --git a/views.py b/views.py
--- a/views.py
+++ b/views.py
@@ -20,7 +20,11 @@ def list_users():
-  users = User.query.all()
-  return [{"name": u.name, "posts": [p.title for p in u.posts]} for u in users]
+  users = User.query.options(selectinload(User.posts)).all()
+  return [{"name": u.name, "posts": [p.title for p in u.posts]} for u in users]

Expected model finding (excerpt)

{
  "id": "perf-nplus1-002",
  "file": "views.py",
  "line": 21,
  "severity": "medium",
  "category": "performance",
  "title": "Potential N+1 query mitigated by eager loading",
  "description": "Original code risked N+1 queries; change introduces selectinload, which is recommended.",
  "evidence": "Accessing u.posts in a list comprehension over users.",
  "owasp": [],
  "tags": ["sqlalchemy","n+1","eager-loading"],
  "suggestion": { "type": "explanation", "content": "Retain selectinload; if list is large, paginate results." }
}

6) Integrations with GitHub Actions and GitLab CI

GitHub Actions

# .github/workflows/codex-review.yml
name: Codex-Class Automated Review

on:
  pull_request:
    types: [opened, synchronize, reopened, ready_for_review]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install deps
        run: npm ci
        if: ${{ hashFiles('**/package-lock.json') != '' }}

      - name: Analyze PR
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GITHUB_REPOSITORY: ${{ github.repository }}
          PR_NUMBER: ${{ github.event.number }}
        run: node scripts/analyze-pr.js

      - name: Post status
        if: failure()
        run: echo "Critical issues identified by Codex-class reviewer."

GitLab CI

# .gitlab-ci.yml
stages: [review]

codex_review:
  stage: review
  image: python:3.11
  variables:
    PIP_DISABLE_PIP_VERSION_CHECK: "1"
  before_script:
    - pip install openai requests
  script:
    - python scripts/review_mr.py
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  artifacts:
    when: always
    reports:
      junit: report.xml
  allow_failure: false

Mapping findings to checks

GitHub: post a summary comment and optional line comments; fail the job on criticals to block merge (optional).
GitLab: post MR notes; optionally output JUnit-format to integrate with merge request widgets.

7) Custom Review Rules and Team Standards

Codex-class agents become significantly more valuable when they encode your team’s conventions and risk appetite. Externalize rules in a versioned repo file (e.g., .codex-review.yml) and feed them into prompts.

Rules config example (.codex-review.yml)

modes:
  - security
  - performance
  - maintainability

severity_thresholds:
  block_merge: ["critical"]    # fail CI when these severities appear
  notify_only: ["low","info"]  # do not fail CI

language_policies:
  JavaScript:
    - rule: "No blocking I/O in HTTP handlers"
      evidence_hint: "fs.readFileSync, child_process.execSync"
      severity: "high"
    - rule: "Use parameterized queries"
      evidence_hint: "raw SQL string concatenation"
      severity: "critical"

  Python:
    - rule: "Avoid unsafe deserialization"
      evidence_hint: "pickle.loads on variable input"
      severity: "high"

dependency_policies:
  allowed_licenses: ["MIT","Apache-2.0","BSD-3-Clause"]
  ban_packages: ["left-pad@*"]
  require_pins: true

file_exclusions:
  - "dist/**"
  - "**/*.snap"
  - "**/*.lock"

style_conventions:
  max_function_length: 80
  max_cyclomatic_complexity: 10

Injecting rules into prompts

Summarize rules into a concise note (rulesNote) and include in your system message. Provide evidence hints to improve precision.

Team-specific examples

Security baseline: require authz on routes under /admin and /billing.
Performance baseline: ban synchronous FS/crypto within request lifecycles.
Maintainability baseline: fail for new functions exceeding cyclomatic complexity 10.

8) Handling False Positives and Tuning Sensitivity

Practical strategies

Evidence requirement: model must quote exact code lines and conditions. Discard findings without evidence.
Severity gating: only block merges on critical/high; surface medium/low as comments.
Cross-tool corroboration: only elevate if static tool (e.g., ESLint, Semgrep, Bandit) and model agree.
Allowlist/waivers: support per-finding suppression IDs (e.g., codex-ignore: perf-nplus1-002) with expiry.
Temperature/Top-p: keep temperature low (≤ 0.2) and avoid creative sampling for reviews.
Feedback loop: track dismissed findings and feed patterns back into prompts (e.g., “Do not report XYZ in tests/benchmark files”).

Confidence scoring

Have the model self-assess confidence in evidence and only include findings above a threshold. You can optionally add a “confidence” field in your JSON schema and filter below a cutoff.

Deduplication and stability

Dedupe by file:line:title:category key.
Pin model version and temperature to maintain consistent outputs.
Cache unchanged file analyses by commit hash to avoid re-analysis noise.

9) End-to-End Examples: Prompts and Outputs

Example: Insecure SQL construction (OWASP Injection)

diff --git a/userRepo.js b/userRepo.js
--- a/userRepo.js
+++ b/userRepo.js
@@ -33,7 +33,9 @@ export async function findUser(db, username) {
-  const q = `SELECT * FROM users WHERE name = '${username}'`;
-  return db.query(q);
+  // Use parameterized query to avoid SQL injection
+  const q = "SELECT * FROM users WHERE name = $1";
+  return db.query(q, [username]);
}

Model output (JSON excerpt)

{
  "summary": {
    "overview": "Replaced string-concatenated SQL with parameterized query to mitigate injection risk.",
    "risk_score": 3,
    "categories": ["security","maintainability"]
  },
  "findings": [
    {
      "id": "sec-sqlinj-001",
      "file": "userRepo.js",
      "line": 34,
      "severity": "critical",
      "category": "security",
      "title": "SQL injection risk via string concatenation",
      "description": "User input 'username' was embedded into a SQL string without parameterization.",
      "evidence": "Original: `SELECT * FROM users WHERE name = '${username}'`",
      "owasp": ["A03:2021-Injection"],
      "tags": ["sql","node","db"],
      "suggestion": {
        "type": "patch",
        "content": "Use parameterized query: db.query('SELECT * FROM users WHERE name = $1', [username])"
      }
    }
  ]
}

Example: Leaked secret in config

diff --git a/config.py b/config.py
--- a/config.py
+++ b/config.py
@@ -3,6 +3,7 @@
-API_KEY = "sk-live-1234567890abcdef"
+API_KEY = os.getenv("API_KEY")

Model output (JSON excerpt)

{
  "findings": [
    {
      "id": "sec-secret-001",
      "file": "config.py",
      "line": 4,
      "severity": "high",
      "category": "security",
      "title": "Hard-coded secret",
      "description": "A live API key string was present in config. Moved to environment variable.",
      "evidence": "Found pattern 'sk-live-********'",
      "owasp": ["A02:2021-Cryptographic Failures"],
      "tags": ["secret","config"],
      "suggestion": {
        "type": "patch",
        "content": "Rotate the key and use environment variables or a secret manager. Ensure the old key is revoked."
      }
    }
  ]
}

Example: Maintainability (complex function)

diff --git a/utils.js b/utils.js
--- a/utils.js
+++ b/utils.js
@@ -10,7 +10,18 @@ export function process(items) {
-  // 120-line function (omitted) with nested logic
+  // Consider splitting into smaller pure functions:
+  // - parseInput
+  // - transform
+  // - validate
+  // Improves testability and reduces cyclomatic complexity.
}

Model output (JSON excerpt)

{
  "findings": [
    {
      "id": "maint-complexity-001",
      "file": "utils.js",
      "line": 12,
      "severity": "low",
      "category": "maintainability",
      "title": "High cyclomatic complexity",
      "description": "Function exceeds team complexity threshold; refactor into smaller units.",
      "evidence": "Heuristics indicate >80 LOC and multiple nested branches.",
      "owasp": [],
      "tags": ["complexity","refactor"],
      "suggestion": { "type":"explanation", "content":"Extract pure functions and add unit tests per unit." }
    }
  ]
}

10) Cost Management for High-Volume Repositories

Token budgeting

Cap per-diff characters: truncate long patches and perform targeted second passes only for risky files.
Analyze only changed files; skip renames with no content change.
Use a fast “pre-screen” model to tag likely-safe files; send risky files to a stronger model.
Cache analyses by file content hash to avoid re-analyzing unchanged chunks across pushes.

Approximate token math

Roughly 3–4 characters per token for English-like text; code can vary.
Estimate tokens = (system + user prompt + diff chunk) / chars-per-token.
Compute spend = tokens_in/1M * model_input_price + tokens_out/1M * model_output_price.
Set a max budget per run and bail early if exceeded (e.g., ignore low-severity categories).

Practical cost controls

File filters: exclude generated assets, snapshots, lockfiles from model passes; handle lockfiles with deterministic dependency scanners.
Structured output: JSON is shorter than verbose prose; keeps output tokens low.
Throttle concurrency to avoid spikes; pipeline runs can be serialized for large PRs.
Use differential analysis: only analyze newly changed hunks since last commit.

Example: enforcing a run budget

let tokenBudget = 200_000; // Example tokens per run
for (const chunk of chunks) {
  const est = (chunk.patch.length + 3000) / 4; // rough char-to-token
  if (tokenBudget - est <= 0) { console.warn("Budget exceeded, skipping remaining files."); break; }
  tokenBudget -= est;
  await analyzeChunk(chunk, config);
}

11) Best Practices for Human-in-the-Loop Reviews

Workflow design

Advisory by default: post comments without blocking merges, except for critical/high issues.
Clear escalation: criticals require human validation; assign reviewers automatically (CODEOWNERS).
Patch suggestions: when safe, have the model provide minimal patches; reviewers can accept or adjust.
Decision log: post a summary comment listing accepted/rejected findings for traceability.

Reviewer ergonomics

Aggregate summary at top; line comments for top-severity findings only.
Labels/tags for quick triage: security, performance, maintainability.
Suppressions with context: allow “codex-ignore:<id>” comments with rationale and expiry.

Blending with static tools

Run ESLint/Flake8/Semgrep/Bandit/Snyk in parallel; combine signals to reduce noise.
Let the model “explain” static tool outputs with code context and remediation steps.
Only block on issues with corroboration or high-confidence, high-impact evidence.

12) Observability, Metrics, and Continuous Evaluation

What to measure

Precision/recall proxy: ratio of accepted vs. dismissed findings by severity and category.
Cost per PR: tokens consumed and spend per category.
Latency: time from PR open to comments posted.
Drift: changes in model outputs across versions.

Logging and storage

Store prompt IDs, hashed diffs, model, temperature, and findings (without secrets) for audit.
Keep a small corpus of “golden PRs” to regression-test prompts and model versions.

Automated evaluation loop

Periodically re-run the agent on golden PRs; inspect deltas in findings.
Fine-tune rules and evidence hints based on human feedback trends.
Track false-positive patterns and encode suppressions or clarifications in prompts.

13) Next Steps

With the reference implementation in place, iterate towards a mature review assistant:

Expand language support and specialized prompts (e.g., Terraform, Kubernetes manifests, mobile)
Add auto-fix PRs for purely mechanical changes (e.g., sync I/O to async)
Introduce a review dashboard aggregating findings across repos and sprints
Set up a lightweight approval flow where reviewers can accept model-suggested patches directly

For deeper prompt design patterns and integrating with build systems, see:
Advanced Prompt Engineering for Code Review Automation,
GitHub Actions Integration with OpenAI Models,
and OWASP Top 10 Security Checks with LLMs.

Appendix: Robust Prompt Template (Drop-In)

System:
You are an autonomous senior code reviewer (Codex-class). Output strict JSON:
{
  "summary": { "overview": string, "risk_score": number, "categories": string[] },
  "findings": [
    {
      "id": string,
      "file": string,
      "line": number | null,
      "severity": "critical" | "high" | "medium" | "low" | "info",
      "category": "security" | "performance" | "maintainability" | "style" | "dependency",
      "title": string,
      "description": string,
      "evidence": string,
      "owasp": string[] | [],
      "tags": string[],
      "suggestion": { "type": "explanation" | "patch", "content": string }
    }
  ]
}
Requirements:
- Report only with strong evidence (quote exact code).
- Use OWASP tags for security when applicable.
- Avoid duplicates. Group similar findings.
- Minimal safe patches for straightforward fixes.
- Respect team rules: <rulesNote>.

User:
Analyze this <LANGUAGE> unified diff and produce JSON per schema.

File: <FILENAME>
Changes: +<ADDS> / -<DELS>

<PATCH>

Focus: security (OWASP), performance, maintainability, dependency risks (if provided).
Constraints: concise findings; if none, return empty findings with a summary.

Appendix: Mapping JSON to PR Comments

function findingToComment(f) {
  const header = `[${f.severity.toUpperCase()}][${f.category}] ${f.title}`;
  const owasp = f.owasp?.length ? `OWASP: ${f.owasp.join(", ")}` : "";
  const evidence = f.evidence ? `Evidence:\\n${f.evidence}` : "";
  const suggestion = f.suggestion ? `Suggestion (${f.suggestion.type}):\\n${f.suggestion.content}` : "";
  return `${header}\\n\\n${f.description}\\n\\n${owasp}\\n\\n${evidence}\\n\\n${suggestion}`;
}

Appendix: Semgrep + Codex-class Hybrid

# Run semgrep rules; then have the model explain and prioritize findings
semgrep --config p/owasp-top-ten --json > semgrep.json

// explain_semgrep.js
import fs from "node:fs";
import OpenAI from "openai";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const semgrep = JSON.parse(fs.readFileSync("semgrep.json","utf8"));

const MODEL = "gpt-4o-mini";

const messages = [
  { role: "system", content:
    "You are a Codex-class reviewer. Given Semgrep findings and code snippets, group duplicates, prioritize by exploitability, and output strict JSON with 'findings' including severity, title, description, evidence, owasp, suggestion." },
  { role: "user", content: JSON.stringify(semgrep) }
];

const resp = await client.chat.completions.create({
  model: MODEL, temperature: 0.1, response_format: { type: "json_object" }, messages
});

console.log(resp.choices[0].message.content);

By combining deterministic scanners with Codex-class reasoning, you’ll achieve better signal-to-noise and more actionable reviews.

Markos Symeonides

Inside OpenAI’s Agentic AI Research Paper: 5 Key Findings That Reveal How AI Work Is Evolving from Chat to Autonomous Execution

Posted in How to

Reading Time: 19 minutes

Inside OpenAI’s Agentic AI Research Paper: 5 Key Findings That Reveal How AI Work Is Evolving from Chat to Autonomous Execution Author: Markos Symeonides, ChatGPT AI Hub Table of Contents Executive Summary Paper Overview and Methodology Finding 1: Codex Replaced…

The Codex Task Decomposition Playbook: How to Break Complex Projects into Agent-Ready Subtasks for 10x Faster Delivery

Posted in How to

Reading Time: 23 minutes

The Codex Task Decomposition Playbook: How to Break Complex Projects into Agent-Ready Subtasks for 10x Faster Delivery Author: Markos Symeonides Table of Contents The art of task decomposition for AI agents Why Codex works best with well-scoped, independent tasks The…

35 ChatGPT-5.5 Prompts for Data Analysts: SQL Generation, Dashboard Creation, Statistical Analysis, and Automated Reporting

Posted in How to

Reading Time: 26 minutes

35 ChatGPT-5.5 Prompts for Data Analysts: SQL Generation, Dashboard Creation, Statistical Analysis, and Automated Reporting Author: Markos Symeonides This masterclass curates 35 production-ready prompts tailored for data analysts who want to move faster with ChatGPT-5.5 across five critical workflows: SQL…

The Complete Guide to ChatGPT-5.5 Memory and Personalization: How to Train Your AI Assistant to Understand Your Work Style

Posted in How to

Reading Time: 18 minutes

The Complete Guide to ChatGPT-5.5 Memory and Personalization: How to Train Your AI Assistant to Understand Your Work Style Author: Markos Symeonides Table of Contents Why Memory and Personalization Matter How ChatGPT-5.5 Memory Works Setting Up Custom Instructions Training ChatGPT…

How to Use OpenAI Codex for Automated Code Review: Setting Up PR Analysis, Security Scanning, and Performance Optimization Agents

How to Use OpenAI Codex for Automated Code Review: Setting Up PR Analysis, Security Scanning, and Performance Optimization Agents

Table of Contents

1) Overview and Architecture

Core components

2) Setting Up a Codex-Class Review Agent

Prerequisites

Model selection

Security of secrets

Node.js reference implementation

Why JSON output?

Python reference (GitLab/MR-friendly)

3) Configuring PR Analysis and Prompt Design

What to review in a PR

Prompt structure

System prompt template (reusable)

User prompt template (diff-focused)

Chunking strategy

Deterministic parsing and validation

4) Security Scanning: OWASP Top 10, Secrets, and Dependencies

Coverage map (selected OWASP Top 10)

Secrets scanning via prompt + regex pre-pass

Dependency vulnerability checks (OSV integration)

Access 40,000+ AI Prompts for ChatGPT, Claude & Codex — Free!

Feeding dependency issues to the model

Security-focused prompt augmentations

5) Performance Optimization Checks

Common patterns to flag

Prompt snippet for performance emphasis

Example: Node.js event loop blocking

Expected model finding (excerpt)

Example: Python N+1 query

Expected model finding (excerpt)

6) Integrations with GitHub Actions and GitLab CI

GitHub Actions

GitLab CI

Mapping findings to checks

7) Custom Review Rules and Team Standards

Rules config example (.codex-review.yml)

Injecting rules into prompts

Team-specific examples

8) Handling False Positives and Tuning Sensitivity

Practical strategies

Confidence scoring

Deduplication and stability

9) End-to-End Examples: Prompts and Outputs

Example: Insecure SQL construction (OWASP Injection)

Model output (JSON excerpt)

Example: Leaked secret in config

Model output (JSON excerpt)

Example: Maintainability (complex function)

Model output (JSON excerpt)

10) Cost Management for High-Volume Repositories

Token budgeting

Approximate token math

Practical cost controls

Example: enforcing a run budget

11) Best Practices for Human-in-the-Loop Reviews

Workflow design

Reviewer ergonomics

Blending with static tools

12) Observability, Metrics, and Continuous Evaluation

What to measure

Logging and storage

Automated evaluation loop

13) Next Steps

Appendix: Robust Prompt Template (Drop-In)

Appendix: Mapping JSON to PR Comments

Appendix: Semgrep + Codex-class Hybrid

Get Free Access to 40,000+ AI Prompts for ChatGPT, Claude & Codex

More on this