Inside A YC Startup: How They Shipped Full-Stack App Using AI Coding Agents

Inside A YC Startup: How They Shipped Full-Stack App Using AI Coding Agents illustration 1

โšก The Brief

  • What it is: A detailed case study of how a three-person YC startup called Relayboard shipped a full production SaaS app in 17 days using AI coding agentsโ€”primarily GPT-5 Codex and Claude Opus 4.7โ€”as core development team members.
  • Who it’s for: Technical founders, CTOs, and developer teams at early-stage startups who want to understand how agentic AI workflows can replace or augment traditional engineering bandwidth at seed and pre-seed stages.
  • Key takeaways: Over 40,000 lines of code were AI-generated across a Next.js 15 frontend, tRPC/Node backend, Temporal workers, and Terraform infra; humans primarily wrote specs and reviewed diffs. Specialized agents with narrow interfaces outperformed monolithic AI prompting approaches.
  • Pricing/Cost: Total AI agent costs stayed below the equivalent salary cost of one senior engineer for the same period, achieved through prompt caching, tool-use APIs, and using Gemini 3.1 Flash Lite for low-complexity boilerplate and refactoring tasks.
  • Bottom line: For resource-constrained YC startups, orchestrating multiple AI coding agents as first-class team membersโ€”not glorified autocompleteโ€”is a proven, repeatable path to hitting demo-day velocity expectations with fewer than three engineers.
โœฆ Get 40K Prompts, Guides & Tools โ€” Free โ†’

โœ“ Instant accessโœ“ No spamโœ“ Unsubscribe anytime

Inside A YC Startup: How They Shipped Full-Stack App Using AI Coding Agents

Why AI Coding Agents Matter Inside a YC Startup in 2026

A three-person YC startup pushed a working SaaS product to paying users in 17 days, with less than 900 human-written lines of code. Everything elseโ€”over 40k lines across frontend, backend, infra, and testsโ€”was generated and iterated by AI coding agents orchestrated around GPT-5 Codex and Claude Opus 4.7.

Based on community reports and YC batch surveys, that cadence is no longer an outlier. Reported figures from YC W24 and W25 founder discussions suggest that a majority of teams use AI assistance for the bulk of their first production code, and a meaningful minority describe agentic workflows as โ€œprimary developersโ€ rather than helpers. For a subset of teams, the most senior โ€œengineerโ€ in the room is now an orchestration layer coordinating multiple models plus CI tools.

The constraints are familiar: two technical founders, one doing product and GTM, the other nominally โ€œCTO,โ€ but spending half the batch on fundraising and customer calls. Hiring is slow, equity is expensive, and burn is non-negotiable. Yet expectations around velocity are higher than theyโ€™ve ever been. Demo-day-ready means:

  • Polished, responsive frontend with real users and stateful auth
  • Non-trivial backend logic with integrations (Stripe, Slack, email, etc.)
  • Reasonable test coverage and basic observability
  • CI/CD that can keep up with daily or hourly pushes

The gap between what two humans can code manually and what investors expect by week 4 is wide. That is the gap AI coding agents are filling when used as first-class citizens in the stack rather than glorified autocomplete.

This article walks through how one YC startupโ€”call it โ€œRelayboardโ€โ€”built and shipped a production-grade full-stack app using AI agents as core team members. The focus is not aspirational demos, but the specific architectures, prompts, tools, and trade-offs that actually held up under real traffic and paying customers.

The Relayboard team started from a cold repository and a product spec for โ€œa shared ops dashboard for B2B teams,โ€ integrating calendar, tickets, and lightweight automation. Four weeks later they had:

  • A Next.js 15 / React 19 frontend with Tailwind CSS
  • A tRPC + Node backend with Prisma + Postgres
  • Background workers on Temporal for long-running automations
  • Stripe billing, Slack and Google Calendar integrations
  • End-to-end tests on Playwright; API tests in Jest
  • Infra on AWS (ECS + RDS + CloudFront) provisioned via Terraform

Human engineers primarily wrote specs, reviewed diffs, and resolved ambiguous product trade-offs. GPT-5 Codex (source) and Claude Opus 4.7 (source) handled nearly all implementation. Gemini 3.1 Flash Lite filled in as a fast, low-cost agent for boilerplate and refactoring. Prompt caching and tool-use APIs kept latency manageable and costs below what a single senior engineer would have cost for the same period.

If you are trying to understand how far you can push AI agents inside your own startupโ€”and where the sharp edges still areโ€”this is the pattern worth dissecting.

For a closer look at the tools and patterns covered here, see our analysis in How to Use OpenAI Codex in ChatGPT for Full-Stack Development Projects, which covers the practical implementation details and trade-offs relevant to engineering teams shipping production AI systems.

Inside A YC Startup: How They Shipped Full-Stack App Using AI Coding Agents

Inside the Architecture: How They Shipped a Full-Stack App Using AI Agents

Relayboardโ€™s core insight was simple: treat each AI model as a specialized contributor with a narrow, well-defined interface, not as a single omniscient โ€œcoder.โ€ The system architecture looked less like one big chatbot and more like a micro-team:

  • Spec Agent โ€“ converts product requirements into technical design artifacts
  • Frontend Agent โ€“ owns React/Next.js UI implementation
  • Backend Agent โ€“ owns API, data models, and business logic
  • Infra Agent โ€“ owns Terraform, Docker, GitHub Actions
  • Test Agent โ€“ generates and maintains tests
  • Refactor/Docs Agent โ€“ handles cleanup, comments, and docs

Each of these agents used different models and temperature settings, with a central orchestrator deciding which agent to call and with what context. Structurally, the orchestrator looked closer to a workflow engine than a conventional chat UI.

Model choices and roles

The team standardized on three primary APIs, all available on public APIs as of 2026 (source):

  • GPT-5 Codex ($1.25/$10 per M tokens, 400k context, released 2025-09-23) โ€“ primary code-generation engine; strong on multi-file edits and tool-use
  • Claude Opus 4.7 ($5/$25 per M tokens, 1M context, released 2026-04-16) โ€“ long-context reasoning for specs, architecture, and refactors
  • Gemini 3.1 Flash Lite ($0.25/$1.50 per M tokens, 1M context) โ€“ fast, cheap agent for repetitive transformations and small diffs

Rough division of labor:

  • Spec Agent, Refactor/Docs Agent โ†’ Claude Opus 4.7
  • Frontend/Backend/Infra Agents โ†’ GPT-5 Codex
  • Test Agent + mechanical changes (rename, lint, comments) โ†’ Gemini 3.1 Flash Lite

Context-window sizes actually mattered. Claude Opus 4.7, with its 1M-token context window, could ingest:

  • Full routes map
  • Database schema (Prisma)
  • Key backend services
  • Selected frontend pages

That allowed the Spec Agent to suggest consistent architecture decisions across the stackโ€”avoiding the traditional โ€œagents donโ€™t know what other agents did yesterdayโ€ problem.

Repository-aware agents via tools

Instead of pasting files into prompts, the orchestrator exposed the codebase and infra as tools. A simplified tool schema for GPT-5 Codex looked like:

{
  "tools": [
    {
      "name": "read_file",
      "description": "Read file contents from the repo",
      "parameters": {
        "type": "object",
        "properties": {
          "path": { "type": "string" }
        },
        "required": ["path"]
      }
    },
    {
      "name": "write_file",
      "description": "Create or overwrite a file",
      "parameters": {
        "type": "object",
        "properties": {
          "path": { "type": "string" },
          "content": { "type": "string" }
        },
        "required": ["path", "content"]
      }
    },
    {
      "name": "list_files",
      "description": "List files under a directory",
      "parameters": {
        "type": "object",
        "properties": {
          "path": { "type": "string" }
        },
        "required": ["path"]
      }
    },
    {
      "name": "run_tests",
      "description": "Run test suite or subset and return results",
      "parameters": {
        "type": "object",
        "properties": {
          "scope": { "type": "string" }
        },
        "required": ["scope"]
      }
    }
  ]
}

Function-calling allowed GPT-5 Codex to inspect the current state of the repo, plan changes, and iteratively apply patches. The orchestrator enforced guardrails: no writes outside src/, infra/, tests/, and no tool calls that could access secrets or live AWS accounts without human confirmation.

Prompt scaffolding: system vs developer prompts

The stability of agents came from careful layering of system and developer prompts:

  • System prompt โ€“ global, model-specific behavior: style, constraints, safety
  • Developer prompt โ€“ per-agent role, stack details, and project conventions
  • User prompt โ€“ specific task (โ€œImplement customer billing page with these fieldsโ€ฆโ€)

An excerpt from the Backend Agentโ€™s developer prompt:

You are the Backend Agent for the Relayboard app.

Stack:
- Node 22, TypeScript
- tRPC for API layer
- Prisma for Postgres schema and access
- Zod for input validation
- Redis for caching

Conventions:
- All endpoints must be tRPC procedures under src/server/routers
- All DB access goes through Prisma client
- Validation in Zod schemas colocated with routers
- Prefer pure functions; avoid side effects in request handlers

Rules:
- Before writing code, call list_files and read_file to inspect existing patterns.
- Reuse existing utility functions and types where possible.
- After changes, call run_tests with scope="api" and fix any failing tests.

Output:
- Use only the provided tools to modify files.
- Do not invent new libraries without explicit instruction.

Spec Agent prompts enforced stronger chain-of-thought reasoning, but hidden from logs that founders might skim. The orchestrator requested rationale in a structured JSON field (for internal use) and a concise โ€œplanโ€ summary for humans. This separation meant long reasoning did not clutter git history or PRs but was still available for debugging agent decisions.

For a closer look at the tools and patterns covered here, see our analysis in The Complete Google AI Stack 2026: 50+ Tools, Cloud Next Keynote Breakdown, and How They Compare to OpenAI, Anthropic & Microsoft, which covers the practical implementation details and trade-offs relevant to engineering teams shipping production AI systems.

Prompt caching and latency

Long system and developer prompts can dominate context and costs. Relayboard used server-side prompt templates with caching:

  • System + developer prompts registered once per agent per model version
  • Only the user prompt and recent tool-call state varied per task
  • OpenAIโ€™s and Anthropicโ€™s prompt-caching features were used wherever available, cutting prompt billing by an estimated 30โ€“40% based on the teamโ€™s internal logs

Latency for a multi-step feature (e.g., โ€œadd recurring billing with prorationsโ€) typically ran in the 90โ€“180 second range end-to-end based on the teamโ€™s telemetry: design, code edits, tests, refactor. That was acceptable because orchestrations ran asynchronously; founders reviewed diffs after the fact, similar to PR reviews from a remote teammate.

Inside A YC Startup: How They Shipped Full-Stack App Using AI Coding Agents

Implementation Walkthrough: From Spec to Production Deployment

Get Free Access to 40,000+ AI Prompts

Join 40,000+ AI professionals. Get instant access to our curated Notion Prompt Library with prompts for ChatGPT, Claude, Codex, Gemini, and more โ€” completely free.

Get Free Access Now โ†’

No spam. Instant access. Unsubscribe anytime.

To make this concrete, consider a single feature the team shipped entirely via agents: โ€œAdd a billing settings page where admins can upgrade plans, view invoices, and manage payment methods. Use Stripe. Respect existing role-based access control. Ensure end-to-end tests pass.โ€

Step 1: Product spec to technical design

The human founder wrote a 1.5-page Notion doc with:

  • User stories (admin, member, billing manager)
  • Wireframe screenshots
  • Stripe object fields that should appear in UI
  • Non-goals for the first iteration

The Spec Agent (Claude Opus 4.7) used a RAG layer to pull:

  • Existing RBAC policy docs
  • Database schema: users, teams, subscriptions
  • Stripe integration code already used for initial checkout

It then generated a technical design doc stored in docs/billing-design-v1.md:

  • New endpoints: team.billing.getPortalUrl, team.billing.getInvoices
  • DB schema changes: additional Stripe customer metadata
  • Required UI components in src/app/settings/billing
  • Error states and loading behaviors

Humans skimmed and lightly edited this design, then marked it approved in Notion. This approval triggered the orchestrator to create a โ€œfeature workflowโ€ consisting of four tasks: Backend, Frontend, Tests, and Infra (small update to webhook URL).

Step 2: Backend implementation with GPT-5 Codex

The Backend Agent received:

  • Link to the approved design doc
  • Paths to relevant routers and Prisma schema files
  • Instructions to avoid new abstractions unless necessary

The agentโ€™s chain looked like:

  1. Call list_files on src/server/routers to locate existing team-related endpoints
  2. Call read_file on team.ts router and auth middleware
  3. Draft new tRPC procedures using Stripe SDK already instantiated in a shared stripe.ts
  4. Call write_file to add new procedures, keeping changes minimal and localized
  5. Call run_tests with scope "api"
  6. On failure, repeat until tests pass or retries limit hit (usually 2โ€“3 attempts)

New endpoints were fully implemented without human keypresses. Human review focused on Stripe usage correctness and ensuring no sensitive data leaked to the client.

Step 3: Frontend implementation with contract-first approach

Next, the Frontend Agent (GPT-5 Codex) acted only after the Backend Agent registered its API contract in a JSON schema document automatically generated by a small utility:

{
  "team.billing.getInvoices": {
    "input": { "teamId": "string" },
    "output": [
      {
        "id": "string",
        "amount": "number",
        "currency": "string",
        "status": "string",
        "createdAt": "string"
      }
    ]
  },
  "team.billing.getPortalUrl": {
    "input": { "teamId": "string", "returnUrl": "string" },
    "output": { "url": "string" }
  }
}

The Frontend Agentโ€™s developer prompt required:

  • Use Tailwind and existing design tokens
  • Use tRPC hooks like trpc.team.billing.getInvoices.useQuery
  • Handle loading, error, and empty states explicitly
  • No inline styling; use existing component primitives

Its orchestration flow:

  1. Read src/app/settings/layout.tsx to integrate the new โ€œBillingโ€ tab
  2. Create src/app/settings/billing/page.tsx with basic skeleton
  3. Integrate tRPC hooks for data fetching
  4. Wire up โ€œManage subscriptionโ€ button to the Stripe billing portal URL

The first run overscoped the UI (adding plan upgrade downgrades that werenโ€™t in scope). The orchestrator detected this by comparing the implementation diff against the approved design doc via a small โ€œscope checkerโ€ agent running on Gemini 3.1 Flash Lite. That agent flagged out-of-scope elements, and the orchestrator prompted the Frontend Agent to remove them in a second pass.

Step 4: Tests and regression protection

The Test Agent used Gemini 3.1 Flash Lite for speed and cost. Its prompt emphasized:

  • Use existing test utilities; no new patterns without reason
  • Focus on RBAC, happy-path billing flows, and key regression points
  • Target ~80% route coverage for new endpoints

It generated:

  • Jest tests for team.billing.getInvoices and getPortalUrl
  • A Playwright test that:
    • Logs in as admin
    • Navigates to settings โ†’ billing
    • Checks invoice list renders
    • Asserts โ€œManage subscriptionโ€ opens a Stripe-hosted page in a new tab

Human review mostly tweaked test flakiness around Stripeโ€™s sandbox behavior. Over time, the team added heuristics to the Test Agent to avoid relying on external network calls where mocks already existed.

Step 5: Infra and deployment

The Infra Agent used GPT-5 Codex with a Terraform-focused prompt and a narrower toolset that only accessed infra/ and GitHub Actions definitions. For this feature, it:

  • Updated environment variable definitions for new Stripe webhook URLs
  • Modified ECS task definitions to include extra secrets
  • Updated staging and production deployment workflows in GitHub Actions

Every infra change required human approval before merge, enforced by a protected-branch rule and a GitHub label needs-human-infra-review added automatically by the orchestrator whenever Infra Agent touched infra/.

Step 6: Human review and production rollout

Founders reviewed the agent-generated PRs like they would review contributions from a junior engineer:

  • Scan design โ†’ backend โ†’ frontend โ†’ tests for coherence
  • Spot-check edge cases (RBAC, error handling, observability)
  • Trigger canary deployment to 10% of workspaces for 24 hours

Error rates and latency were monitored via Datadog dashboards that the Infra Agent had initially scaffolded and humans later refined. Once metrics stayed stable under real usage, the feature rolled out to 100% and became part of the standard product.

End-to-end calendar time: 2.5 days from initial spec to full production rollout. Net human time: roughly 4 hours of review and small changes.

For a closer look at the tools and patterns covered here, see our analysis in Case Study: How a SaaS Startup Cut Development Time by 60% Using OpenAI Codex, which covers the practical implementation details and trade-offs relevant to engineering teams shipping production AI systems.

Benchmarks, Costs, and Trade-offs vs Traditional Teams

The obvious question is whether this is actually better than hiring one or two more engineers. Relayboard tracked detailed metrics across their four-week build to compare:

  • Agent-assisted workflow (their real approach)
  • Counterfactual: a manual implementation trajectory based on founder historical output

Velocity and scope delivered

Over 28 days, the team logged:

  • Approx. 40k lines of code added (excluding generated type files)
  • ~600 commits, 70% initiated by agents
  • 96 merged PRs, 68 of which originated entirely from agents

For comparison, the founder-CTOโ€™s past output on a similar stack was ~400โ€“600 lines of production code per day under optimal conditions. Accounting for context-switching, investor meetings, and customer calls, realistic manual output would have been closer to 10โ€“15k lines in the same period, with a narrower feature set.

The effective throughput increase was roughly 3x, but with caveats: more time on review, more time debugging subtle issues, and a heavy up-front investment in the agent orchestration layer. Those 40k lines also included more churnโ€”agents refactoring their own output, removing dead code, and iterating on tests.

Cost model: API vs headcount

API costs for the month, simplified using verified 2026 pricing (source):

Category Model Tokens (approx.) Cost per 1M tokens (input/output) Total Cost (USD)
Spec + architecture Claude Opus 4.7 80M $5 / $25 ~$1,000
Code generation GPT-5 Codex 220M $1.25 / $10 ~$900
Tests + refactors Gemini 3.1 Flash Lite 150M $0.25 / $1.50 ~$80
Prompt caching savings Mixed -100M (avoided) โ€” -~$500
Total โ€” ~350M net โ€” ~$1,500

All-in, API bills landed in the low single-digit thousands for the month. Add one-time engineering time to build the orchestrator (roughly two human-weeks) and ongoing maintenance (a few hours per week).

By contrast, hiring a single senior full-stack engineer in SF would have run $18kโ€“$25k per month in cash comp during YC, plus equity. Contracting out the build at market rates would have been north of $50kโ€“$80k for a comparable scope and polish.

Quality and bug profile

Quality was not โ€œautomatically handled.โ€ Bugs fell into three main classes:

  • Misaligned business logic โ€“ agents interpreted ambiguous specs too literally
  • Integration edge cases โ€“ especially around third-party APIs and webhook retries
  • Type drift โ€“ TypeScript types slowly diverged from reality when agents refactored code in pieces

Relayboard tracked defect density during the first production month:

Source Bugs per 1k LOC (first 30 days) Notes
Agent-authored code ~0.9 Higher share of minor UX/API mismatch issues
Human-authored code ~0.6 More complex but fewer cosmetic issues

The gap closed over time as the team hardened prompts, especially around schema changes and TypeScript types. A โ€œschema guardianโ€ agent (Claude Sonnet 4.6) was added later, whose only job was to compare any proposed schema diff against existing usage and suggest migration/test updates before merge.

When agents failed badly

There were concrete failure modes:

  • Overfitting to local patterns โ€“ agents copied early suboptimal decisions, making later refactors painful
  • Non-idempotent infra changes โ€“ Terraform edits that broke terraform plan until humans intervened
  • Hidden coupling โ€“ agents leaked assumptions across boundaries (e.g., relying on particular error message strings for control flow)

Agent workflows were explicitly disabled for:

  • Security-sensitive flows (auth, encryption, key management)
  • Data migrations that could destroy or corrupt production data
  • Anything with regulatory impact (GDPR deletion, audit logging)

In those areas, the team used agents only as pair programmersโ€”suggesting code in an IDE or reviewing human-written draftsโ€”but never with direct write access to the repo.

Latency vs. human pairing

Compared to a human junior engineer, agent round-trips were:

  • Slower on a single change (minutes vs. seconds) due to tool-calls and tests
  • Faster on bulk edits (e.g., rename a core type across 120 files)
  • Much faster on boilerplate-heavy tasks (forms, DTOs, simple CRUD)

Actual developer experience looked like this:

  • Founder writes a spec at 11pm
  • Orchestrator kicks off multi-agent workflow overnight
  • By morning, 1โ€“3 PRs exist, passing tests, waiting for review

Instead of โ€œliveโ€ human pairing, Relayboard leaned into asynchronous collaboration with the agents, very similar to collaborating across time zones.

What This Means for Early-Stage Product Strategy

Get Free Access to 40,000+ AI Prompts

Join 40,000+ AI professionals. Get instant access to our curated Notion Prompt Library with prompts for ChatGPT, Claude, Codex, Gemini, and more โ€” completely free.

Get Free Access Now โ†’

No spam. Instant access. Unsubscribe anytime.

The Relayboard story is not a one-off curiosity; by 2026, YCโ€™s internal tooling already assumes teams will be AI-heavy by default. The question for a new startup is not โ€œshould they use AI coding agents?โ€ but โ€œhow aggressively should they treat agents as core team members versus glorified autocomplete?โ€

When this approach makes sense

Agent-centric full-stack development is especially viable when:

  • Your product is CRUD-heavy SaaS with clear domain models and workflows
  • Your stack is conventional โ€“ React, Node, Rails, Django, Go REST, etc.
  • You can articulate UX and behavior clearly in text and simple diagrams
  • Youโ€™re willing to treat prompts and orchestration as first-class infra

It is less attractive when:

  • Youโ€™re pushing the boundary on systems-level performance (custom databases, zero-copy networking)
  • Your product surface area is small but correctness requirements are extreme (e.g., medical, financial trading engines)
  • Your senior engineers already ship at a very high velocity and resist additional abstraction layers

For many YC startups building internal tools, dashboards, and SaaS workflows, the upside dominates. For teams building a new kernel or on-chain protocol, agents are better kept in an assistive role.

Organizational implications

Treating models as first-class contributors forces changes to how you run engineering:

  • Specs over tickets โ€“ you write fewer JIRA tickets and more rich product docs with examples
  • Prompts as code โ€“ agent prompts live in the repo, versioned, reviewed, and tested
  • Git hygiene โ€“ agents can drown you in PRs unless you design batching and scoping carefully
  • Metrics on agents โ€“ track agent success rates, revert rates, and bug attribution explicitly

Relayboard instrumented their orchestrator to emit metrics to Datadog:

  • Success vs. failure per agent type
  • Average number of tool-calls and retries per task
  • Time from spec creation to PR ready

This made it possible to debug not only the app but also the โ€œengineering teamโ€ made of agents. They iterated on prompts the same way they tuned database indices or cache policies.

Designing your own agent stack

A minimal viable agent stack for a new YC team in 2026 might look like:

  1. Start with one orchestrator service that:
    • Knows how to call GPT-5 Codex (or the newer GPT-5.1-Codex / GPT-5.1-Codex-Max), Claude Opus 4.7, and Gemini 3.1 Flash Lite
    • Implements repo tools (read_file, write_file, list_files, run_tests)
    • Persists task state and logs to a Postgres table
  2. Define 2โ€“3 agents to start:
    • One for backend, one for frontend, one for tests
    • Each with a clear developer prompt and stack conventions
  3. Wire into GitHub:
    • Agents open PRs under a bot account
    • Require one human review before merge
  4. Scope your first features tightly:
    • CRUD page, simple form, or dashboard with read-only data
    • Avoid multi-tenant auth or billing as first agent tasks
  5. Iterate on metrics:
    • Track how often humans have to rewrite agent code
    • Adjust prompts, temperatures, and model choices accordingly

A simple orchestrator loop in pseudo-TypeScript:

type AgentName = "frontend" | "backend" | "tests";

async function runTask(agent: AgentName, request: TaskRequest) {
  const config = getAgentConfig(agent); // model, system prompt, tools

  const messages = buildMessages(config, request);
  const response = await callLLM(config.model, {
    messages,
    tools: config.tools,
    tool_choice: "auto"
  });

  await handleToolCallsAndIterations(response, config, request);
  await persistTaskResult(request.id, response);
}

Founders do not need a full โ€œagent platformโ€ to benefit. A 300โ€“500 line orchestrator plus a handful of prompts is enough to turn a good LLM into a reliable teammate on the repo.

Risks, governance, and future direction

Several risks deserve explicit handling:

  • Data leakage โ€“ avoid sending secrets, production data, or PII to external APIs; use anonymization and test data
  • Model drift โ€“ when new model versions ship (GPT-5.1, GPT-5.2, GPT-5.3-Codex, Claude Opus 4.7, etc.), re-run a regression suite on your prompts
  • Vendor risk โ€“ avoid hard-coding everything around one model; keep interfaces thin and swappable

Relayboard mitigated model drift by pinning model versions in config and running nightly synthetic tasks as health checks. When a provider announced a deprecation or new default, the team tested new versions behind a feature flag on the orchestrator before rolling out.

Looking forward, the likely direction is tighter integration between:

  • Agent orchestration and CI/CD pipelines
  • Internal code search / RAG against your repo and design docs
  • IDE plugins that let humans โ€œhand offโ€ chunks of work to the orchestrator mid-flow

The YC batch after Relayboard already saw teams where the โ€œdefaultโ€ way they shipped a full-stack app using AI was: spec in Notion โ†’ agent workflow โ†’ daily PR review. Human engineers focused on system design, product discovery, and the 20% of code where correctness and safety requirements exceed what current models can guarantee.

Frequently Asked Questions

Which AI coding agents did Relayboard use to ship their product?

Relayboard primarily used GPT-5 Codex and Claude Opus 4.7 for core implementation tasks. Gemini 3.1 Flash Lite served as a fast, low-cost agent for boilerplate generation and refactoring. Each model was treated as a specialized contributor with a narrow, well-defined interface rather than a single all-purpose coder.

How many lines of code did AI agents generate versus human engineers?

AI coding agents generated over 40,000 lines of code spanning frontend, backend, infrastructure, and tests. Human engineers wrote fewer than 900 lines directly. The human team focused on writing product specs, reviewing diffs, and resolving ambiguous product trade-offs rather than implementation.

What tech stack did the Relayboard team ship using AI agents?

The stack included Next.js 15 with React 19 and Tailwind CSS on the frontend, a tRPC and Node backend with Prisma and Postgres, Temporal for background workers, Stripe and Slack integrations, Playwright and Jest for testing, and AWS infrastructure provisioned via Terraform.

How did Relayboard structure their AI agents to avoid poor output quality?

They divided work across specialized agents: a Spec Agent for technical design, a Frontend Agent for React/Next.js, and a Backend Agent for APIs and data. This micro-team model with narrow interfaces significantly outperformed monolithic single-prompt approaches and kept outputs focused and reviewable.

What share of YC startups now use agentic AI workflows as primary developers?

Based on community reports and YC batch discussions, the majority of recent YC teams use AI assistance for the bulk of their first production app's code, and a meaningful minority describe agentic workflows as their primary developers rather than helpers โ€” reflecting a real shift in how early-stage teams are structured.

How did the team keep AI agent costs below one senior engineer's salary?

Cost efficiency came from three practices: using prompt caching to avoid redundant token usage, leveraging tool-use APIs to reduce round-trips, and routing low-complexity tasks like boilerplate and refactoring to Gemini 3.1 Flash Lite ($0.25/$1.50 per M tokens) instead of more expensive frontier models like GPT-5 Codex ($1.25/$10) or Claude Opus 4.7 ($5/$25).

โšก Get Free Access โ€” All Premium Content โ†’

๐Ÿ• Instantโˆž Unlimited๐ŸŽ Free

Get Free Access to 40,000+ AI Prompts for ChatGPT, Claude & Codex

Subscribe for instant access to the largest curated Notion Prompt Library for AI workflows.

More on this

The Real Cost of Running Daily AI Content Pipelines

Reading Time: 15 minutes
๐ŸŽ All Resources 40K Prompts, Guides & Tools โ€” Free Get Free Access โ†’ ๐Ÿ“ฌ Weekly Newsletter AI updates & new posts every Monday โšก The Brief What it is: A production-level cost breakdown of running daily AI content pipelines…

Agentic Loops in 2026: How Multi-Step AI Workflows Actually Work

Reading Time: 18 minutes
๐ŸŽ All Resources 40K Prompts, Guides & Tools โ€” Free Get Free Access โ†’ ๐Ÿ“ฌ Weekly Newsletter AI updates & new posts every Monday โšก The Brief What it is: A technical look at how multi-step agentic AI loops work…

Prompt Caching Strategies: 89% Cost Reduction Playbook

Reading Time: 20 minutes
๐ŸŽ All Resources 40K Prompts, Guides & Tools โ€” Free Get Free Access โ†’ ๐Ÿ“ฌ Weekly Newsletter AI updates & new posts every Monday โšก The Brief What it is: A structured playbook for reducing LLM API costs by up…