Part 02 of 13

Decisions That Cannot Be Undone

The architecture blueprint locks in infrastructure decisions before the first feature is built. Changing your database, auth provider, or multi-tenancy model after 50 features are shipped is a rewrite, not a refactor.

~12 minute read

Why Architecture Before Features

Technical founders often pick tools intuitively. "I have used Postgres before, so Postgres." "Auth0 looks fine." "We will figure out multi-tenancy later." This works for prototypes. It fails catastrophically for products built with AI agents, because every agent session inherits those decisions. A bad early decision does not just live in one file. It propagates across every spec, every phase plan, and every AI-generated implementation.

The architecture blueprint is a single document that answers every infrastructure question before any code is written. It has two parts: the research process (how decisions are made) and the decision table (what was decided and why).

Principle: Research before committing. The architecture blueprint is the most expensive document to change later. Invest 1-2 weeks in research. The front-loaded investment saves months of rework. Every hour spent researching a decision saves 10 hours of migration later.

The Research Question Method

Before making any decision, identify every open question. Number them. Research them systematically. Document the findings. Then decide.

How It Works

Create a numbered list of research questions. Each question has a standard structure:

RQ-01: [Question title]
Context:    Why this matters for the product
Options:    The alternatives being considered
Criteria:   How we will evaluate (cost, complexity, scalability, DX, community)
Research:   Findings per option (documentation, benchmarks, community reports)
Decision:   What was chosen and why
Revisit:    Under what conditions this decision should be reconsidered

The research process is not "ask the AI which database to use." It is a structured investigation where the AI helps you gather and synthesize information, but you make the decision based on your product's specific needs.

Typical Research Questions

For a SaaS product, expect 15-30 research questions across these categories:

Category	Example Questions	Count
Auth	Managed vs self-hosted? Which provider? Social login scope?	2-3
Database	Serverless vs provisioned? Connection pooling strategy? Branching model?	3-4
Multi-tenancy	RLS vs schema-per-tenant vs DB-per-tenant? Isolation testing approach?	2-3
Real-time	SSE vs WebSocket vs polling? Offline strategy? Conflict resolution?	2-4
File storage	S3 vs R2 vs managed? Presigned URLs? Virus scanning?	1-2
Background jobs	Framework comparison? Long-running task handling? Retry strategy?	2-3
AI/LLM	Abstraction layer? Provider routing? Cost tracking? Fallback strategy?	3-5
Observability	Error tracking, logging, analytics, APM. Free tier viability?	2-3
Deployment	Monorepo strategy? CI/CD pipeline? Environment management?	2-4

Linear Example: Research Questions (Selected)

RQ-01: Real-Time Sync Strategy

Context: Linear's core experience is speed. When one user moves an issue to "In Progress," every other user viewing that board must see the change instantly. This is not a "nice to have." It is fundamental to the product positioning.

Options considered:

Polling (5s interval): Simple. No infrastructure. But 5 seconds of stale data breaks the "instant" feel. Polling 50 users at 5s intervals = 600 requests/minute per board.
Server-Sent Events (SSE): One-directional. Server pushes updates. Simple to implement. No bidirectional channel. Works through most proxies. But no client-to-server channel for optimistic updates.
WebSocket: Bidirectional. Low latency. But requires sticky sessions or a WebSocket gateway. More complex infrastructure. Connection management overhead.
Custom sync protocol (CRDT-based): Offline-first. Conflict-free merges. Works without network. But extreme complexity. 6-12 months of engineering for the sync engine alone.

Decision: WebSocket with a custom sync layer. Not full CRDT (too complex for v1), but a sync protocol that handles optimistic updates, conflict detection, and reconnection. The investment is high but aligns with the core positioning: speed is a feature.

Revisit when: If user concurrency exceeds 1000 simultaneous editors per workspace, evaluate CRDT for conflict-free merging.

RQ-07: Multi-Tenancy Model

Context: Linear is a workspace-based product. Each company gets a workspace. Data must be strictly isolated. A bug that leaks one company's issues to another is a company-ending event.

Options considered:

Schema-per-tenant: Strong isolation. Easy to reason about. But 10,000 schemas = 10,000 migration runs on every schema change. Does not scale past ~100 tenants.
Database-per-tenant: Strongest isolation. Easy backups per tenant. But connection pooling nightmare. Cannot do cross-tenant analytics. Expensive.
Row-Level Security (RLS): Single schema. Postgres-native enforcement. Every query automatically filtered by workspace_id. Scales to millions of tenants. But requires discipline: every table needs the workspace_id column, every query must go through the RLS-enabled connection.

Decision: Row-Level Security. Single schema, single database, RLS policies on every table. A tenant-scoped connection wrapper ensures no query bypasses RLS. Integration tests explicitly verify isolation (create data in tenant A, attempt to read from tenant B, assert empty result).

Revisit when: If a single tenant exceeds 10M rows per table, evaluate partition-by-tenant strategy within the same RLS model.

Research note [R2]: Multi-tenancy isolation failures are the most dangerous class of bug in SaaS products. RLS provides database-level enforcement that cannot be bypassed by application code bugs. However, RLS policies must be tested explicitly: create a test that inserts data for Tenant A, switches context to Tenant B, and asserts zero results. This test pattern is built into the spec template's test strategy section. See Research: R2.

The Decision Table

After research is complete, consolidate every decision into a single table. This table is the reference that every downstream document checks against. If a PRD says "store files in S3" but the decision table says "Cloudflare R2," the PRD is wrong.

Structure

Every row answers: for this concern, what did we choose, what else was considered, why this choice, and when to revisit?

Linear Example: Decision Table (Partial)

Concern	Decision	Alternatives Considered	Rationale
Auth	Clerk	Auth0, Supabase Auth, NextAuth, custom	Best React/Next.js integration. Pre-built components. Org/workspace support. Webhook events for sync.
Database	Postgres (Neon serverless)	PlanetScale, Supabase, CockroachDB, Aurora	Serverless branching for dev/test isolation. Auto-scaling. Postgres ecosystem (RLS, pg_trgm, tsvector).
ORM	Drizzle	Prisma, Kysely, raw SQL	Type-safe. SQL-like syntax. No query engine overhead. Schema-as-code for branching workflows.
Multi-tenancy	RLS	Schema-per-tenant, DB-per-tenant	Single schema scales to millions. DB-enforced isolation. Simpler migrations. Lower cost.
Real-time	WebSocket + custom sync	SSE, polling, CRDT (Yjs/Automerge)	Bidirectional for optimistic updates. Custom sync for conflict detection. CRDT too complex for v1.
File storage	Cloudflare R2	AWS S3, Supabase Storage	S3-compatible API. Zero egress fees. Global edge distribution. Lower cost at scale.
Email	Postmark	SendGrid, Resend, SES	Best deliverability reputation. Separate streams (transactional vs broadcast). Inbound parsing for email-to-issue.
Background jobs	Inngest	BullMQ, Quirrel, Trigger.dev	Serverless. Event-driven. Step functions for multi-step workflows. Built-in retry, throttling, concurrency control.
Heavy AI tasks	Trigger.dev	AWS Lambda, custom workers	Long-running (up to 5 min). Checkpointing. Separate from lightweight Inngest jobs. Good DX.
API layer	tRPC	REST, GraphQL	End-to-end type safety. No code generation. Works with TanStack Query. Co-located with Next.js.
Frontend state	TanStack Query + Zustand	Redux, Jotai, SWR	TanStack for server state (cache, refetch, optimistic). Zustand for client state (UI state, filters, selections).
LLM abstraction	Vercel AI SDK	LangChain, LlamaIndex, direct API	Streaming. Provider-agnostic. Works with Next.js. Structured output. Tool calling.
Observability	Sentry + PostHog + Axiom	Datadog, New Relic, custom	Free tiers cover launch. Sentry for errors. PostHog for analytics + session replay. Axiom for logs.
Deployment	Vercel (per-app projects)	AWS, Railway, Fly.io	Native Next.js support. Preview deployments. Zero-config CI/CD. Per-app projects in monorepo.

This table is not a suggestion list. It is a binding constraint. Every phase plan, every spec, and every AI agent must use these tools. If a research question arises during implementation ("should we use Redis for caching?"), the answer is either in this table or the table gets formally updated with the same research process.

Multi-Tenancy: The Decision That Touches Everything

Multi-tenancy deserves special attention because it affects every table, every query, every API route, and every test. Get it wrong and you either leak data between tenants (catastrophic) or create a migration nightmare (expensive).

The Three Models

Three multi-tenancy models. RLS (highlighted) is the recommended default for SaaS products.

RLS is the recommended default for most SaaS products. It scales, it is database-enforced (meaning application bugs cannot leak data), and it has a single migration path. The tradeoff is discipline: every table needs a tenant_id column, and every query must go through a tenant-scoped connection wrapper.

The Tenant Connection Pattern

The key implementation detail: create a wrapper function that sets the tenant context on every database connection. Every query in the application goes through this wrapper. Never use the raw database client directly.

// Generic pattern (adapt to your ORM)
function tenantDb(tenantId: string) {
  // Set the Postgres session variable that RLS policies read
  return db.$withAuth({ tenantId });
  // Every query through this connection is automatically filtered
  // by: WHERE tenant_id = current_setting('app.tenant_id')
}

// In a tRPC router
const issues = protectedProcedure.query(async ({ ctx }) => {
  const db = tenantDb(ctx.tenantId);
  return db.select().from(issues); // RLS filters automatically
});

Critical: Test tenant isolation explicitly. RLS policies can have subtle bugs (wrong column reference, missing policy on a new table). Every spec that creates a new table must include a test that: (1) inserts data for Tenant A, (2) switches to Tenant B's context, (3) asserts zero results returned. This test pattern is enforced in the spec template.

Monorepo Structure

For a multi-app product, a monorepo is the natural choice. All apps share types, UI components, database schema, and API contracts. The package structure enforces the centralization principle: apps are consumers, packages are owners.

apps/ web/ Product app (customer-facing) admin/ Internal admin console docs/ Documentation site packages/ ui/ All shared UI components (sole owner) db/ Database schema, migrations, tenant wrapper (sole owner) api/ tRPC routers, API contracts (sole bridge to infrastructure) types/ Shared TypeScript types and Zod schemas auth/ Auth utilities, middleware, session management ai/ LLM gateway: metering, gating, cost tracking, provider routing email/ Email templates and delivery events/ Event bus, background job definitions billing/ Stripe integration, plan definitions metering/ Usage tracking, tier enforcement search/ Full-text search utilities testing/ Test utilities, factories, fixtures config/ Shared configuration (environment, feature flags)

The key constraints that make this work:

Apps import from packages, never from other apps. If apps/web needs something from apps/admin, it does not exist. That shared logic belongs in a package.
packages/api is the sole bridge to infrastructure. Apps never import packages/db directly. They go through tRPC routers in packages/api.
packages/ui is the sole owner of UI components. No component is defined in an app. All shared UI lives in one place.
packages/ai is the sole gateway for LLM calls. No app or package calls OpenAI or Anthropic directly. All calls route through the AI gateway for metering, cost tracking, and provider routing.

These constraints are not guidelines. They are enforced by linter rules, import restrictions, and architectural fitness tests (covered in Part 11: Preventing Drift).

Linear Example: Package Boundaries

Linear's monorepo (based on public information and engineering blog posts) follows a similar structure. The sync engine is a separate package. The UI component library is shared. The API layer is centralized. The key insight: every team at Linear imports from the same UI library. There is no "team A's button" and "team B's button." There is one button, in one package, used everywhere.

For a solo founder, this matters even more. AI agents working on different features will independently create components if there is no central library. Within two milestones, you have three slightly different table components, two modal patterns, and five loading states. Centralizing UI into one package prevents this.

AI Pipeline Architecture (If AI-Native Product)

If the product uses AI/LLM capabilities (document extraction, classification, summarization, chat), the AI pipeline needs its own architectural decisions. This is not "call OpenAI." It is a system that meters usage, tracks cost per tenant, routes between providers, handles failures gracefully, and maintains accuracy targets.

The Centralized AI Gateway

All AI calls route through a single package. This gateway handles:

Provider routing: Choose between OpenAI, Anthropic, or local models based on task type, cost, and availability.
Metering: Track token usage per tenant per operation. Enforce tier limits.
Cost tracking: Calculate cost per call. Aggregate per tenant per billing period.
Fallback: If the primary provider is down, route to the secondary. If all providers are down, queue the request for retry.
Structured output: Parse LLM responses into typed objects. Validate against schemas. Handle parsing failures.
Confidence scoring: For extraction tasks, score each field's confidence. Flag low-confidence fields for human review.

The centralized AI gateway: all LLM calls route through one package for metering, cost tracking, and provider routing

Accuracy Targets

For any AI feature that extracts or classifies data, define accuracy targets before implementation. These targets become test assertions in the spec.

Linear Example: AI in a Project Management Tool

Linear uses AI for: auto-triage (classify incoming issues by team and priority), duplicate detection (flag issues that are similar to existing ones), and writing assistance (generate issue descriptions from titles).

For auto-triage, the accuracy target might be: "Correctly assign team with 85%+ accuracy on issues with clear team signals in the title or description. For ambiguous issues, flag for manual assignment rather than guessing." This target becomes a test: run auto-triage on 100 labeled test issues, assert >= 85 correct assignments, assert zero catastrophic misassignments (e.g., billing issue routed to infrastructure team).

What You Leave With

After completing this step, you have one document:

Document	Size	Key Output
Architecture Blueprint	15-30 pages	Decision table (15-30 decisions), research question findings, multi-tenancy strategy, monorepo structure, AI pipeline design, CI/CD pipeline outline

This document is the technical constitution. Every downstream command references it. When a phase plan needs to decide which job framework to use, it checks the architecture blueprint. When a spec needs to know how to structure an API route, it checks the blueprint. When an AI agent is unsure about a tool choice, it checks the blueprint.

No implementation command should ever re-derive an infrastructure decision. If the decision is not in the blueprint, add it through the research question process. If the decision is in the blueprint, follow it without question. This is what makes AI agents reliable: they do not make judgment calls about infrastructure. They follow the blueprint.

Next, the product needs a face. Part 3 covers the brand system, UI component inventory, and UX coherence guidelines that ensure every screen, built by any agent, looks and feels like the same product.

01: It Starts with the Product

03: The Visual Language