Part 07 of 13

The AI Agent Writes Code

Why test-first is structural, not just good practice. The shallow assertion problem and how to detect it. Three commands, three fresh sessions, one machine per spec.

~12 minute read

Why Test-First Matters More with AI Agents

Test-driven development is a well-known practice. But with AI agents, it is not just good practice. It is a structural necessity.

When a human writes code and then writes tests, the human has an independent mental model of what the code should do. The tests verify the code against that mental model. When an AI agent writes code and then writes tests, it has no independent mental model. It derives the tests from the code it just wrote. The tests verify that the code does what the code does: a tautology.

By writing tests first, in a separate session, from a spec that was written before any code existed, the tests represent the intended behavior. The implementation must satisfy the tests, not the other way around.

Principle: Tests are the spec's voice during execution. The spec describes what should happen. The tests encode that description as executable assertions. The implementation is judged against the tests, which were derived from the spec, which was derived from the phase plan, which was derived from the PRD. The chain of custody from product intent to code verification is unbroken.

The Shallow Assertion Problem

Even with test-first development, AI agents tend to generate shallow assertions. A shallow assertion passes regardless of whether the code is correct.

// SHALLOW: Passes even if createIssue returns garbage
it('creates an issue', async () => {
  const result = await createIssue(db, { title: 'Bug fix' });
  expect(result).toBeDefined();        // Any non-null passes
  expect(result.id).toBeTruthy();      // Any truthy passes
});

// DEEP: Fails if createIssue returns wrong data
it('inserts issue with correct fields and workspace isolation', async () => {
  expect.assertions(5);
  const result = await createIssue(db, {
    title: 'Bug fix',
    teamId: team.id,
    priority: 2,
  });
  expect(result.title).toBe('Bug fix');
  expect(result.teamId).toBe(team.id);
  expect(result.priority).toBe(2);
  expect(result.workspaceId).toBe(workspace.id);
  // Verify DB state independently (not just return value)
  const [row] = await db.select().from(issues)
    .where(eq(issues.id, result.id));
  expect(row.title).toBe('Bug fix');
});
Research note [R1]: Dakhel et al. (2024) introduced MuTAP, showing that LLM-generated tests improved to 93.57% mutation score when augmented with surviving mutant feedback, but required post-processing. Meta's ACH system (Foster et al., FSE 2025) deployed mutation-guided LLM test generation at scale: privacy engineers accepted 73% of generated tests, and the LLM equivalence detector achieved 0.95 precision / 0.96 recall with preprocessing. Both studies confirm: LLM-generated tests need structural guardrails to achieve production quality. See Research: R1.

Three-Layer Defense

The system addresses shallow assertions at three levels:

LayerWhenMechanism
PreventionSpec generationSelf-validation bans toBeDefined(), toBeTruthy(), toBeFalsy() as standalone assertions. Every async test requires expect.assertions(N). Every DB test verifies state independently.
Detectionex-test executionESLint rules flag shallow matchers. Auto-fix replaces them with specific assertions. Agent cannot commit until ESLint passes.
Verificationex-verify executionStrykerJS mutates the implementation and re-runs tests. If a test still passes after a code mutation, the assertion is shallow. Target: >= 70% mutation score per spec.

The Three-Command Model

Execution is split into three commands, each running in a fresh AI coding agent session. This solves two problems simultaneously:

Memory leaks. AI coding agent processes (Claude Code, Cursor) grow from 200MB to 12-23GB over extended sessions. Fresh sessions reset to baseline.

Context degradation. After 3-4 auto-compactions, the agent loses nuance: forgets decisions, reimplements solved problems, makes inconsistent choices. Each command gets a clean context window.

Session 1: ex-test Model: Sonnet (fast, cheap) Reads: Block A only Output: Test files (all failing) Git: Commit locally Session 2: ex-impl Model: Sonnet (fast, cheap) Reads: Block B + test IDs Output: Implementation (green) Git: Squash, commit locally Session 3: ex-verify Model: Opus (deep reasoning) Reads: Spec + Phase + PRD Output: Verified code Git: Push to main fresh session fresh session RED GREEN VERIFY + PUSH

ex-test: Write Tests (Red Phase)

Reads Block A of the spec. Generates all test files. Runs shallow assertion linting. Verifies all new tests fail (expected) and all existing tests still pass.

What it does step by step:

  1. Rebase onto latest main.
  2. Create a throwaway database branch for test infrastructure validation.
  3. Push schema to the branch (verify tables compile).
  4. Generate all test files from Block A.
  5. Run ESLint shallow assertion check. Auto-fix violations.
  6. Run full test suite: new tests fail (expected), existing tests pass.
  7. Run code review on generated test files.
  8. Drop the database branch.
  9. Commit locally. Update tracking files.

ex-impl: Write Implementation (Green Phase)

Reads Block B of the spec. Executes tasks sequentially. After each task, runs the mapped tests. Fixes failures immediately. Never skips or weakens a test.

Tier 1-2 verification before committing:

Squashes all work into a single commit. Stays local. Does not push.

ex-verify: Quality Gate (Verify + Push)

This is the quality gate. It reads MORE context: the spec, the phase plan, AND the PRD. The agent approaches the code as a reviewer with no memory of implementation decisions.

What it does:

1. Mutation testing. StrykerJS mutates the implementation and re-runs tests. Mutations include: changing === to !==, removing conditional branches, swapping return values, deleting function calls. If a test still passes after a mutation, the assertion is shallow. If the score is below 70%, the agent analyzes surviving mutations and deepens assertions.

2. Exploratory testing (frontend/full-stack). The agent uses extended reasoning to generate 5-15 targeted adversarial scenarios before launching a headed browser. These are not generic tests ("click random things") but specific scenarios derived from the feature's data model and edge cases. While exploring, the agent monitors console output, network responses, and database state simultaneously.

3. Code review. With fresh context (no memory of implementation struggles), the agent reviews: spec compliance, test quality, pattern consistency, security (tenant isolation, input validation), performance (missing indexes, N+1 queries), and visual match to mock.

4. Migration generation. Generates canonical migration files from the schema changes (during execution, only schema push was used, not migration generation). Handles migration number conflicts from parallel specs.

5. Push to main. Rebases onto latest main, resolves conflicts, re-runs all tests, pushes. This is the ONLY command that pushes to the shared codebase.

Nothing reaches the shared codebase until ex-verify passes. Other instances never pull incomplete or unverified work. The push is the seal of approval.

Model Routing for Cost Optimization

Not every command needs the most capable model. Route by purpose:

CommandModel TierRationale
ex-test, ex-impl, generate-spec, verify-specFast + cheap (e.g., Sonnet)Well-defined pattern translation. Speed matters. 90%+ of Opus capability at lower cost.
ex-verify, generate-phase, verify-prd, apply-learningsDeep reasoning (e.g., Opus)Cross-referencing, adversarial exploration, trend analysis. Quality over speed.

This reduces costs by 60-80% compared to using the most capable model for everything, while concentrating reasoning power where it has the highest impact.

Linear Example: Execution Flow for issue.create

ex-test: Generates 6 tests for issue.create: valid creation with all fields, creation with minimal fields (title only), validation failure (title too long), validation failure (invalid teamId), workspace isolation (create in workspace A, verify invisible from workspace B), event emission (issue.created event dispatched after insert).

ex-impl: Creates the router procedure. After each sub-task, runs the relevant tests. Fix cycle: initial implementation misses the event emission. Test A1-T6 fails. Agent adds event dispatch. Test passes. Tier 1-2 clean.

ex-verify: Mutation testing: changes eq(issues.workspaceId, ctx.workspaceId) to eq(issues.workspaceId, 'fixed-id'). Test A1-T5 (isolation) catches the mutation. Score: 83%. Exploratory: creates an issue with Unicode title (emoji, RTL text, zero-width characters). All render correctly. Pushes to main.