Part 10 of 13

Keeping It Running

Observability stack, SLO monitoring, and an autonomous ops agent that diagnoses problems the moment they appear, not when you notice them.

~8 minute read

The Observability Stack

Observability is not optional, even for a solo founder. Without it, you learn about production issues from customer complaints. With it, you learn before customers notice.

Five Pillars

Pillar	What It Monitors	Tool Category
Error tracking	Unhandled exceptions, error rates, error trends	Sentry or equivalent
Logging	Structured logs from all services, searchable, queryable	Axiom, Datadog, or equivalent
Analytics	User behavior, feature adoption, session replay	PostHog, Amplitude, or equivalent
Uptime	Health endpoint monitoring, SSL expiry, DNS resolution	UptimeRobot, Better Uptime
APM	Response times (p50/p95/p99), database query times, external API latency	New Relic, Datadog APM

Start with free tiers. Most observability tools have generous free tiers that cover early-stage products. Sentry (5K errors/month), PostHog (1M events/month), Axiom (500GB/month), UptimeRobot (50 monitors). You can run a full observability stack for $0/month through launch and early growth.

What to Instrument

Every API route: Response time, status code, tenant ID. Structured log per request.
Every database query: Query time, rows returned. Flag queries over 100ms.
Every AI/LLM call: Provider, model, tokens in/out, latency, cost. Per tenant.
Every background job: Start time, end time, success/failure, retry count.
Every file operation: Upload size, download latency, storage provider response time.
Client-side: Page load time, largest contentful paint, interaction to next paint, JavaScript errors.

SLO Definition

Service Level Objectives define "how good is good enough." Without SLOs, everything feels urgent. With SLOs, you know which degradations matter and which are within tolerance.

Example SLO Table

Operation	Target (p95)	Critical (p99)	Measurement
Page load (initial)	<1.5s	<3s	Client-side timing via analytics
API response (read)	<200ms	<500ms	Server-side timing via APM
API response (write)	<500ms	<1s	Server-side timing via APM
Search query	<300ms	<800ms	Server-side timing
AI extraction	<30s	<60s	Background job timing
Uptime	99.9%	99.5%	Health endpoint monitoring
Error rate	<0.1%	<1%	Error tracker event rate

SLO breaches generate alerts. Warning-level breaches (p95 exceeded) are logged. Critical breaches (p99 exceeded) trigger PagerDuty notifications and enter the autonomous diagnosis flow.

The Autonomous Ops Agent

For a solo founder, an always-on monitoring agent is not a luxury. It is a necessity. You cannot watch dashboards 24/7. The ops agent watches for you and acts when something goes wrong.

Architecture

The ops agent runs on a separate, always-on machine (dedicated server or cloud instance). It has: full code checkout access (for diagnosis), API access to all observability tools, database read access (for health queries), and the ability to create issues and PRs.

Integration Points

The agent pulls from every service in the stack:

Hosting platform: Deployment status, build logs, runtime logs, function invocations.
Database: Connection count, query performance, replication lag, storage usage.
Error tracking: New errors, error rate trends, release correlation.
Logging: Structured log queries, anomaly detection.
Analytics: Feature flag state, session replay links for error contexts.
Background jobs: Queue depth, failure rates, retry counts, latency trends.
Auth: Active sessions, failed auth attempts, MFA adoption.
Email: Delivery rates, bounce rates, spam reports.
Payments: Failed charges, MRR, churn indicators.
AI/LLM providers: Token usage, cost per provider, error rates, latency trends.

Three Modes of Operation

Mode 1: Passive Monitoring

Continuous polling of all integration points. Threshold tracking. Trend visualization on a dashboard. No action taken. This is the baseline: "I can see what is happening."

Mode 2: Active Alerting

When thresholds breach, the agent sends notifications through the appropriate channel based on severity:

Severity	Channel	Examples
Low	Dashboard log	Bundle size increase, minor performance regression
Medium	Email	New error type, p95 SLO warning breach, job latency increase
High	Email + issue	p99 SLO critical breach, error rate spike
Critical	PagerDuty + email + issue	Service down, data integrity issue, payment processing failure

Mode 3: Autonomous Remediation

The most advanced mode. When the agent detects a critical issue, it does not just notify. It starts working on a fix. This is the same three-tier auto-fix system from the deployment pipeline (Part 9), but triggered by runtime issues rather than deployment failures.

The agent checks out the code, reads the error context, proposes a fix, and either applies it automatically (Tier 1), creates a PR (Tier 2), or creates a detailed issue (Tier 3). The key difference from deployment diagnosis: runtime issues have live user impact, so the agent prioritizes speed for Tier 1 fixes and includes rollback recommendations for Tier 2-3.

Autonomous mode requires guardrails. The agent should never auto-commit database schema changes, security-related fixes, or anything that modifies payment processing. These are always Tier 2 or Tier 3 (human review required). The tier classification is configurable and should be conservative at launch, expanding as confidence grows.

09: Shipping Code

11: Preventing Drift