Part 11 of 13

Preventing Drift

Architectural fitness functions as tests. Pattern enforcement that scales. Consolidation budget per milestone. Comprehension checkpoints. Metrics that drive process improvement.

~10 minute read

Architectural Fitness Functions

Code rules in a markdown file are suggestions. Code rules as executable tests are enforced constraints. Architectural fitness functions turn structural rules into tests that run on every commit.

What Fitness Functions Enforce

// Example: ArchUnitTS-style fitness tests

describe('Architecture rules', () => {

  // Apps never import from other apps
  it('apps do not cross-import', () => {
    projectFiles()
      .inFolder('apps/web')
      .shouldNot()
      .dependOnFiles()
      .inFolder('apps/admin');
  });

  // Only the DB package talks to the ORM
  it('only packages/db imports the ORM', () => {
    projectFiles()
      .notInFolder('packages/db')
      .shouldNot()
      .dependOn('drizzle-orm');
  });

  // Only the AI package talks to LLM providers
  it('only packages/ai imports LLM SDKs', () => {
    projectFiles()
      .notInFolder('packages/ai')
      .shouldNot()
      .dependOn('openai')
      .and()
      .shouldNot()
      .dependOn('@anthropic-ai/sdk');
  });

  // UI components only from the shared package
  it('apps do not directly import headless UI', () => {
    projectFiles()
      .inFolder('apps/')
      .matching('*.tsx')
      .shouldNot()
      .dependOn('@radix-ui');
  });
});

Research note [R6]: ArchUnitTS (and ArchUnit for Java) provides a key safety property: if a rule matches zero files (typo in path, package renamed), the test fails instead of silently passing. This prevents false confidence. A rule that matches nothing is as dangerous as no rule at all. See Research: R6.

These tests run in CI alongside unit tests. An AI agent that accidentally imports the ORM directly from a router gets caught instantly, before push, no human review needed.

Consolidation Budget

Every milestone allocates 15-20% of its time for consolidation. This is not optional. It is a line item in the build plan.

What Consolidation Covers

Deduplication scan. Find duplicate code across all units. Common culprits: utility functions, type definitions, validation logic, error messages.
Naming consistency pass. Same concept must use same name everywhere. If three routers call it "workspace_id" but one calls it "org_id," standardize.
Pattern consistency pass. Similar operations (CRUD, error handling, validation) follow the same pattern. Pick the best implementation and standardize.
Complexity check. Run complexity metrics per package. Flag files above threshold. Refactor.
Documentation alignment. Update foundation docs (code rules, architecture, component plan) to reflect what was actually built.
ADR review. Read every decision in phase plans from this milestone. Verify each still makes sense given what was actually built.

Consolidation follows the same pipeline: it gets a phase plan (generated by scanning the milestone's code) and specs (specific refactoring tasks). Tests run after every change. This is not ad hoc cleanup. It is structured improvement.

Comprehension Checkpoints

After consolidation, before the integration test. This is a human exercise. No AI involved.

Write a document answering these questions:

What can the user do now that they could not before this milestone? (3-5 sentences, in your own words)
What are the main components built and how do they connect? (No copying from PRDs. Explain it.)
If you had to explain the system to a new CTO, what would you say?
What parts of the codebase are you least confident about? (Honest assessment.)
What would you change if starting this milestone over? (Hindsight learnings.)

If you cannot answer these without looking at code, you have comprehension debt. Slow down and rebuild your mental model before adding more complexity. AI agents build fast. Understanding what they built takes deliberate effort. A founder who cannot explain their own system cannot make good product decisions about it.

The Feedback Loop: apply-learnings

Learnings are applied, not accumulated. There is no LEARNINGS.md that grows forever. After every spec execution, a dedicated command reads the execution findings and proposes specific changes to actual documents.

Six Categories of Changes

Category	What Changes	Example
Phase plan updates	Add missing indexes, update schemas, refine API contracts	"Unread count query needed a composite index not specified in the phase plan. Add to Section 6."
Spec updates	Fix test cases, add assertions, update file paths	"Test A1-T3 was missing a negative assertion for empty input. Add."
Template updates	If the same issue recurs, update the template to prevent it	"Three specs in a row missed tenant isolation tests. Add to template checklist."
Command updates	If a command consistently misses something	"ex-test not catching shallow assertions in array.length checks. Add ESLint rule."
New specs	Missing functionality discovered during execution	"Notification preferences need their own API. Create spec M3-U06-P1-S3."
Mock updates	Implementation correctly deviates from mock	"Loading skeleton timing looks better at 75ms stagger, not 50ms. Update mock."

Every change is proposed as a diff: old text, new text, reason. The human approves each one. Specs that were already written but not yet executed are flagged for re-verification.

Metrics-Driven Process Improvement

The system tracks execution metrics in JSON format with two layers: detailed per-spec logs and aggregated per-milestone summaries.

What to Track Per Spec

Duration per command (ex-test, ex-impl, ex-verify).
Fix-and-retest cycles during ex-impl.
Mutation score (initial and final) from ex-verify.
Exploratory testing issues found (critical, major, minor).
Deviations from the spec (things done differently than planned).
Human interventions (times the agent stopped for human input).

Milestone Review Signals

Signal	What It Means	Action
Duration trending up	Phase plans may need more detail	Review generate-phase output quality
Deviations climbing	Planning quality slipping	Review spec accuracy, add missing patterns
Rework rate above 10%	Specs missing critical details	Review generate-spec validation
Mutation score dropping	Test quality degrading	Tighten ESLint rules, review StrykerJS thresholds
Human interventions increasing	Specs not self-sufficient	Review spec completeness, reduce deferred decisions

At the end of every milestone, run the apply-learnings command in milestone mode. It uses deep reasoning to analyze ALL exec logs and metrics for the milestone, identifying systemic patterns not visible at the per-spec level.

10: Operations

12: Research Notes