The AI Agent Writes Code
Why test-first is structural, not just good practice. The shallow assertion problem and how to detect it. Three commands, three fresh sessions, one machine per spec.
~12 minute read
Why Test-First Matters More with AI Agents
Test-driven development is a well-known practice. But with AI agents, it is not just good practice. It is a structural necessity.
When a human writes code and then writes tests, the human has an independent mental model of what the code should do. The tests verify the code against that mental model. When an AI agent writes code and then writes tests, it has no independent mental model. It derives the tests from the code it just wrote. The tests verify that the code does what the code does: a tautology.
By writing tests first, in a separate session, from a spec that was written before any code existed, the tests represent the intended behavior. The implementation must satisfy the tests, not the other way around.
The Shallow Assertion Problem
Even with test-first development, AI agents tend to generate shallow assertions. A shallow assertion passes regardless of whether the code is correct.
// SHALLOW: Passes even if createIssue returns garbage
it('creates an issue', async () => {
const result = await createIssue(db, { title: 'Bug fix' });
expect(result).toBeDefined(); // Any non-null passes
expect(result.id).toBeTruthy(); // Any truthy passes
});
// DEEP: Fails if createIssue returns wrong data
it('inserts issue with correct fields and workspace isolation', async () => {
expect.assertions(5);
const result = await createIssue(db, {
title: 'Bug fix',
teamId: team.id,
priority: 2,
});
expect(result.title).toBe('Bug fix');
expect(result.teamId).toBe(team.id);
expect(result.priority).toBe(2);
expect(result.workspaceId).toBe(workspace.id);
// Verify DB state independently (not just return value)
const [row] = await db.select().from(issues)
.where(eq(issues.id, result.id));
expect(row.title).toBe('Bug fix');
});
Three-Layer Defense
The system addresses shallow assertions at three levels:
| Layer | When | Mechanism |
|---|---|---|
| Prevention | Spec generation | Self-validation bans toBeDefined(), toBeTruthy(), toBeFalsy() as standalone assertions. Every async test requires expect.assertions(N). Every DB test verifies state independently. |
| Detection | ex-test execution | ESLint rules flag shallow matchers. Auto-fix replaces them with specific assertions. Agent cannot commit until ESLint passes. |
| Verification | ex-verify execution | StrykerJS mutates the implementation and re-runs tests. If a test still passes after a code mutation, the assertion is shallow. Target: >= 70% mutation score per spec. |
The Three-Command Model
Execution is split into three commands, each running in a fresh AI coding agent session. This solves two problems simultaneously:
Memory leaks. AI coding agent processes (Claude Code, Cursor) grow from 200MB to 12-23GB over extended sessions. Fresh sessions reset to baseline.
Context degradation. After 3-4 auto-compactions, the agent loses nuance: forgets decisions, reimplements solved problems, makes inconsistent choices. Each command gets a clean context window.
ex-test: Write Tests (Red Phase)
Reads Block A of the spec. Generates all test files. Runs shallow assertion linting. Verifies all new tests fail (expected) and all existing tests still pass.
What it does step by step:
- Rebase onto latest main.
- Create a throwaway database branch for test infrastructure validation.
- Push schema to the branch (verify tables compile).
- Generate all test files from Block A.
- Run ESLint shallow assertion check. Auto-fix violations.
- Run full test suite: new tests fail (expected), existing tests pass.
- Run code review on generated test files.
- Drop the database branch.
- Commit locally. Update tracking files.
ex-impl: Write Implementation (Green Phase)
Reads Block B of the spec. Executes tasks sequentially. After each task, runs the mapped tests. Fixes failures immediately. Never skips or weakens a test.
Tier 1-2 verification before committing:
- Tier 1 (automated): Full test suite, typecheck, lint, security scan, dependency allow-list, hardcoded values scan, console log check, backend log check, database state check.
- Tier 2 (self-review): All tasks completed? All files match spec? Patterns consistent with referenced files? Visual match to mock (frontend)?
Squashes all work into a single commit. Stays local. Does not push.
ex-verify: Quality Gate (Verify + Push)
This is the quality gate. It reads MORE context: the spec, the phase plan, AND the PRD. The agent approaches the code as a reviewer with no memory of implementation decisions.
What it does:
1. Mutation testing. StrykerJS mutates the implementation and re-runs tests. Mutations include: changing === to !==, removing conditional branches, swapping return values, deleting function calls. If a test still passes after a mutation, the assertion is shallow. If the score is below 70%, the agent analyzes surviving mutations and deepens assertions.
2. Exploratory testing (frontend/full-stack). The agent uses extended reasoning to generate 5-15 targeted adversarial scenarios before launching a headed browser. These are not generic tests ("click random things") but specific scenarios derived from the feature's data model and edge cases. While exploring, the agent monitors console output, network responses, and database state simultaneously.
3. Code review. With fresh context (no memory of implementation struggles), the agent reviews: spec compliance, test quality, pattern consistency, security (tenant isolation, input validation), performance (missing indexes, N+1 queries), and visual match to mock.
4. Migration generation. Generates canonical migration files from the schema changes (during execution, only schema push was used, not migration generation). Handles migration number conflicts from parallel specs.
5. Push to main. Rebases onto latest main, resolves conflicts, re-runs all tests, pushes. This is the ONLY command that pushes to the shared codebase.
Model Routing for Cost Optimization
Not every command needs the most capable model. Route by purpose:
| Command | Model Tier | Rationale |
|---|---|---|
| ex-test, ex-impl, generate-spec, verify-spec | Fast + cheap (e.g., Sonnet) | Well-defined pattern translation. Speed matters. 90%+ of Opus capability at lower cost. |
| ex-verify, generate-phase, verify-prd, apply-learnings | Deep reasoning (e.g., Opus) | Cross-referencing, adversarial exploration, trend analysis. Quality over speed. |
This reduces costs by 60-80% compared to using the most capable model for everything, while concentrating reasoning power where it has the highest impact.
ex-test: Generates 6 tests for issue.create: valid creation with all fields, creation with minimal fields (title only), validation failure (title too long), validation failure (invalid teamId), workspace isolation (create in workspace A, verify invisible from workspace B), event emission (issue.created event dispatched after insert).
ex-impl: Creates the router procedure. After each sub-task, runs the relevant tests. Fix cycle: initial implementation misses the event emission. Test A1-T6 fails. Agent adds event dispatch. Test passes. Tier 1-2 clean.
ex-verify: Mutation testing: changes eq(issues.workspaceId, ctx.workspaceId) to eq(issues.workspaceId, 'fixed-id'). Test A1-T5 (isolation) catches the mutation. Score: 83%. Exploratory: creates an issue with Unicode title (emoji, RTL text, zero-width characters). All render correctly. Pushes to main.