Part 02 of 13

Decisions That Cannot Be Undone

The architecture blueprint locks in infrastructure decisions before the first feature is built. Changing your database, auth provider, or multi-tenancy model after 50 features are shipped is a rewrite, not a refactor.

~12 minute read

Why Architecture Before Features

Technical founders often pick tools intuitively. "I have used Postgres before, so Postgres." "Auth0 looks fine." "We will figure out multi-tenancy later." This works for prototypes. It fails catastrophically for products built with AI agents, because every agent session inherits those decisions. A bad early decision does not just live in one file. It propagates across every spec, every phase plan, and every AI-generated implementation.

The architecture blueprint is a single document that answers every infrastructure question before any code is written. It has two parts: the research process (how decisions are made) and the decision table (what was decided and why).

Principle: Research before committing. The architecture blueprint is the most expensive document to change later. Invest 1-2 weeks in research. The front-loaded investment saves months of rework. Every hour spent researching a decision saves 10 hours of migration later.

The Research Question Method

Before making any decision, identify every open question. Number them. Research them systematically. Document the findings. Then decide.

How It Works

Create a numbered list of research questions. Each question has a standard structure:

RQ-01: [Question title]
Context:    Why this matters for the product
Options:    The alternatives being considered
Criteria:   How we will evaluate (cost, complexity, scalability, DX, community)
Research:   Findings per option (documentation, benchmarks, community reports)
Decision:   What was chosen and why
Revisit:    Under what conditions this decision should be reconsidered

The research process is not "ask the AI which database to use." It is a structured investigation where the AI helps you gather and synthesize information, but you make the decision based on your product's specific needs.

Typical Research Questions

For a SaaS product, expect 15-30 research questions across these categories:

CategoryExample QuestionsCount
AuthManaged vs self-hosted? Which provider? Social login scope?2-3
DatabaseServerless vs provisioned? Connection pooling strategy? Branching model?3-4
Multi-tenancyRLS vs schema-per-tenant vs DB-per-tenant? Isolation testing approach?2-3
Real-timeSSE vs WebSocket vs polling? Offline strategy? Conflict resolution?2-4
File storageS3 vs R2 vs managed? Presigned URLs? Virus scanning?1-2
Background jobsFramework comparison? Long-running task handling? Retry strategy?2-3
AI/LLMAbstraction layer? Provider routing? Cost tracking? Fallback strategy?3-5
ObservabilityError tracking, logging, analytics, APM. Free tier viability?2-3
DeploymentMonorepo strategy? CI/CD pipeline? Environment management?2-4
Linear Example: Research Questions (Selected)

RQ-01: Real-Time Sync Strategy

Context: Linear's core experience is speed. When one user moves an issue to "In Progress," every other user viewing that board must see the change instantly. This is not a "nice to have." It is fundamental to the product positioning.

Options considered:

  • Polling (5s interval): Simple. No infrastructure. But 5 seconds of stale data breaks the "instant" feel. Polling 50 users at 5s intervals = 600 requests/minute per board.
  • Server-Sent Events (SSE): One-directional. Server pushes updates. Simple to implement. No bidirectional channel. Works through most proxies. But no client-to-server channel for optimistic updates.
  • WebSocket: Bidirectional. Low latency. But requires sticky sessions or a WebSocket gateway. More complex infrastructure. Connection management overhead.
  • Custom sync protocol (CRDT-based): Offline-first. Conflict-free merges. Works without network. But extreme complexity. 6-12 months of engineering for the sync engine alone.

Decision: WebSocket with a custom sync layer. Not full CRDT (too complex for v1), but a sync protocol that handles optimistic updates, conflict detection, and reconnection. The investment is high but aligns with the core positioning: speed is a feature.

Revisit when: If user concurrency exceeds 1000 simultaneous editors per workspace, evaluate CRDT for conflict-free merging.

RQ-07: Multi-Tenancy Model

Context: Linear is a workspace-based product. Each company gets a workspace. Data must be strictly isolated. A bug that leaks one company's issues to another is a company-ending event.

Options considered:

  • Schema-per-tenant: Strong isolation. Easy to reason about. But 10,000 schemas = 10,000 migration runs on every schema change. Does not scale past ~100 tenants.
  • Database-per-tenant: Strongest isolation. Easy backups per tenant. But connection pooling nightmare. Cannot do cross-tenant analytics. Expensive.
  • Row-Level Security (RLS): Single schema. Postgres-native enforcement. Every query automatically filtered by workspace_id. Scales to millions of tenants. But requires discipline: every table needs the workspace_id column, every query must go through the RLS-enabled connection.

Decision: Row-Level Security. Single schema, single database, RLS policies on every table. A tenant-scoped connection wrapper ensures no query bypasses RLS. Integration tests explicitly verify isolation (create data in tenant A, attempt to read from tenant B, assert empty result).

Revisit when: If a single tenant exceeds 10M rows per table, evaluate partition-by-tenant strategy within the same RLS model.

Research note [R2]: Multi-tenancy isolation failures are the most dangerous class of bug in SaaS products. RLS provides database-level enforcement that cannot be bypassed by application code bugs. However, RLS policies must be tested explicitly: create a test that inserts data for Tenant A, switches context to Tenant B, and asserts zero results. This test pattern is built into the spec template's test strategy section. See Research: R2.

The Decision Table

After research is complete, consolidate every decision into a single table. This table is the reference that every downstream document checks against. If a PRD says "store files in S3" but the decision table says "Cloudflare R2," the PRD is wrong.

Structure

Every row answers: for this concern, what did we choose, what else was considered, why this choice, and when to revisit?

Linear Example: Decision Table (Partial)
ConcernDecisionAlternatives ConsideredRationale
AuthClerkAuth0, Supabase Auth, NextAuth, customBest React/Next.js integration. Pre-built components. Org/workspace support. Webhook events for sync.
DatabasePostgres (Neon serverless)PlanetScale, Supabase, CockroachDB, AuroraServerless branching for dev/test isolation. Auto-scaling. Postgres ecosystem (RLS, pg_trgm, tsvector).
ORMDrizzlePrisma, Kysely, raw SQLType-safe. SQL-like syntax. No query engine overhead. Schema-as-code for branching workflows.
Multi-tenancyRLSSchema-per-tenant, DB-per-tenantSingle schema scales to millions. DB-enforced isolation. Simpler migrations. Lower cost.
Real-timeWebSocket + custom syncSSE, polling, CRDT (Yjs/Automerge)Bidirectional for optimistic updates. Custom sync for conflict detection. CRDT too complex for v1.
File storageCloudflare R2AWS S3, Supabase StorageS3-compatible API. Zero egress fees. Global edge distribution. Lower cost at scale.
EmailPostmarkSendGrid, Resend, SESBest deliverability reputation. Separate streams (transactional vs broadcast). Inbound parsing for email-to-issue.
Background jobsInngestBullMQ, Quirrel, Trigger.devServerless. Event-driven. Step functions for multi-step workflows. Built-in retry, throttling, concurrency control.
Heavy AI tasksTrigger.devAWS Lambda, custom workersLong-running (up to 5 min). Checkpointing. Separate from lightweight Inngest jobs. Good DX.
API layertRPCREST, GraphQLEnd-to-end type safety. No code generation. Works with TanStack Query. Co-located with Next.js.
Frontend stateTanStack Query + ZustandRedux, Jotai, SWRTanStack for server state (cache, refetch, optimistic). Zustand for client state (UI state, filters, selections).
LLM abstractionVercel AI SDKLangChain, LlamaIndex, direct APIStreaming. Provider-agnostic. Works with Next.js. Structured output. Tool calling.
ObservabilitySentry + PostHog + AxiomDatadog, New Relic, customFree tiers cover launch. Sentry for errors. PostHog for analytics + session replay. Axiom for logs.
DeploymentVercel (per-app projects)AWS, Railway, Fly.ioNative Next.js support. Preview deployments. Zero-config CI/CD. Per-app projects in monorepo.

This table is not a suggestion list. It is a binding constraint. Every phase plan, every spec, and every AI agent must use these tools. If a research question arises during implementation ("should we use Redis for caching?"), the answer is either in this table or the table gets formally updated with the same research process.

Multi-Tenancy: The Decision That Touches Everything

Multi-tenancy deserves special attention because it affects every table, every query, every API route, and every test. Get it wrong and you either leak data between tenants (catastrophic) or create a migration nightmare (expensive).

The Three Models

Database per Tenant Tenant A DB Tenant B DB Tenant C DB Strongest isolation Highest cost Migration per DB Does not scale past ~50 Schema per Tenant One DB, N schemas A B C Good isolation Does not scale past ~100 Row-Level Security One DB, one schema Every row has tenant_id DB enforces filtering Scales to millions of tenants Single migration path
Three multi-tenancy models. RLS (highlighted) is the recommended default for SaaS products.

RLS is the recommended default for most SaaS products. It scales, it is database-enforced (meaning application bugs cannot leak data), and it has a single migration path. The tradeoff is discipline: every table needs a tenant_id column, and every query must go through a tenant-scoped connection wrapper.

The Tenant Connection Pattern

The key implementation detail: create a wrapper function that sets the tenant context on every database connection. Every query in the application goes through this wrapper. Never use the raw database client directly.

// Generic pattern (adapt to your ORM)
function tenantDb(tenantId: string) {
  // Set the Postgres session variable that RLS policies read
  return db.$withAuth({ tenantId });
  // Every query through this connection is automatically filtered
  // by: WHERE tenant_id = current_setting('app.tenant_id')
}

// In a tRPC router
const issues = protectedProcedure.query(async ({ ctx }) => {
  const db = tenantDb(ctx.tenantId);
  return db.select().from(issues); // RLS filters automatically
});
Critical: Test tenant isolation explicitly. RLS policies can have subtle bugs (wrong column reference, missing policy on a new table). Every spec that creates a new table must include a test that: (1) inserts data for Tenant A, (2) switches to Tenant B's context, (3) asserts zero results returned. This test pattern is enforced in the spec template.

Monorepo Structure

For a multi-app product, a monorepo is the natural choice. All apps share types, UI components, database schema, and API contracts. The package structure enforces the centralization principle: apps are consumers, packages are owners.

apps/ web/ Product app (customer-facing) admin/ Internal admin console docs/ Documentation site packages/ ui/ All shared UI components (sole owner) db/ Database schema, migrations, tenant wrapper (sole owner) api/ tRPC routers, API contracts (sole bridge to infrastructure) types/ Shared TypeScript types and Zod schemas auth/ Auth utilities, middleware, session management ai/ LLM gateway: metering, gating, cost tracking, provider routing email/ Email templates and delivery events/ Event bus, background job definitions billing/ Stripe integration, plan definitions metering/ Usage tracking, tier enforcement search/ Full-text search utilities testing/ Test utilities, factories, fixtures config/ Shared configuration (environment, feature flags)

The key constraints that make this work:

These constraints are not guidelines. They are enforced by linter rules, import restrictions, and architectural fitness tests (covered in Part 11: Preventing Drift).

Linear Example: Package Boundaries

Linear's monorepo (based on public information and engineering blog posts) follows a similar structure. The sync engine is a separate package. The UI component library is shared. The API layer is centralized. The key insight: every team at Linear imports from the same UI library. There is no "team A's button" and "team B's button." There is one button, in one package, used everywhere.

For a solo founder, this matters even more. AI agents working on different features will independently create components if there is no central library. Within two milestones, you have three slightly different table components, two modal patterns, and five loading states. Centralizing UI into one package prevents this.

AI Pipeline Architecture (If AI-Native Product)

If the product uses AI/LLM capabilities (document extraction, classification, summarization, chat), the AI pipeline needs its own architectural decisions. This is not "call OpenAI." It is a system that meters usage, tracks cost per tenant, routes between providers, handles failures gracefully, and maintains accuracy targets.

The Centralized AI Gateway

All AI calls route through a single package. This gateway handles:

App: Extract App: Classify App: Summarize AI Gateway (packages/ai) Metering | Cost Tracking | Provider Routing | Fallback | Confidence Anthropic (Claude) Complex extraction OpenAI (GPT-4) Classification Local / Fine-tuned High-volume, low-cost Metering DB tokens/tenant/operation
The centralized AI gateway: all LLM calls route through one package for metering, cost tracking, and provider routing

Accuracy Targets

For any AI feature that extracts or classifies data, define accuracy targets before implementation. These targets become test assertions in the spec.

Linear Example: AI in a Project Management Tool

Linear uses AI for: auto-triage (classify incoming issues by team and priority), duplicate detection (flag issues that are similar to existing ones), and writing assistance (generate issue descriptions from titles).

For auto-triage, the accuracy target might be: "Correctly assign team with 85%+ accuracy on issues with clear team signals in the title or description. For ambiguous issues, flag for manual assignment rather than guessing." This target becomes a test: run auto-triage on 100 labeled test issues, assert >= 85 correct assignments, assert zero catastrophic misassignments (e.g., billing issue routed to infrastructure team).

What You Leave With

After completing this step, you have one document:

DocumentSizeKey Output
Architecture Blueprint15-30 pagesDecision table (15-30 decisions), research question findings, multi-tenancy strategy, monorepo structure, AI pipeline design, CI/CD pipeline outline

This document is the technical constitution. Every downstream command references it. When a phase plan needs to decide which job framework to use, it checks the architecture blueprint. When a spec needs to know how to structure an API route, it checks the blueprint. When an AI agent is unsure about a tool choice, it checks the blueprint.

No implementation command should ever re-derive an infrastructure decision. If the decision is not in the blueprint, add it through the research question process. If the decision is in the blueprint, follow it without question. This is what makes AI agents reliable: they do not make judgment calls about infrastructure. They follow the blueprint.

Next, the product needs a face. Part 3 covers the brand system, UI component inventory, and UX coherence guidelines that ensure every screen, built by any agent, looks and feels like the same product.