Part 10 of 13

Keeping It Running

Observability stack, SLO monitoring, and an autonomous ops agent that diagnoses problems the moment they appear, not when you notice them.

~8 minute read

The Observability Stack

Observability is not optional, even for a solo founder. Without it, you learn about production issues from customer complaints. With it, you learn before customers notice.

Five Pillars

PillarWhat It MonitorsTool Category
Error trackingUnhandled exceptions, error rates, error trendsSentry or equivalent
LoggingStructured logs from all services, searchable, queryableAxiom, Datadog, or equivalent
AnalyticsUser behavior, feature adoption, session replayPostHog, Amplitude, or equivalent
UptimeHealth endpoint monitoring, SSL expiry, DNS resolutionUptimeRobot, Better Uptime
APMResponse times (p50/p95/p99), database query times, external API latencyNew Relic, Datadog APM
Start with free tiers. Most observability tools have generous free tiers that cover early-stage products. Sentry (5K errors/month), PostHog (1M events/month), Axiom (500GB/month), UptimeRobot (50 monitors). You can run a full observability stack for $0/month through launch and early growth.

What to Instrument

SLO Definition

Service Level Objectives define "how good is good enough." Without SLOs, everything feels urgent. With SLOs, you know which degradations matter and which are within tolerance.

Example SLO Table

OperationTarget (p95)Critical (p99)Measurement
Page load (initial)<1.5s<3sClient-side timing via analytics
API response (read)<200ms<500msServer-side timing via APM
API response (write)<500ms<1sServer-side timing via APM
Search query<300ms<800msServer-side timing
AI extraction<30s<60sBackground job timing
Uptime99.9%99.5%Health endpoint monitoring
Error rate<0.1%<1%Error tracker event rate

SLO breaches generate alerts. Warning-level breaches (p95 exceeded) are logged. Critical breaches (p99 exceeded) trigger PagerDuty notifications and enter the autonomous diagnosis flow.

The Autonomous Ops Agent

For a solo founder, an always-on monitoring agent is not a luxury. It is a necessity. You cannot watch dashboards 24/7. The ops agent watches for you and acts when something goes wrong.

Architecture

The ops agent runs on a separate, always-on machine (dedicated server or cloud instance). It has: full code checkout access (for diagnosis), API access to all observability tools, database read access (for health queries), and the ability to create issues and PRs.

Integration Points

The agent pulls from every service in the stack:

Three Modes of Operation

Mode 1: Passive Monitoring

Continuous polling of all integration points. Threshold tracking. Trend visualization on a dashboard. No action taken. This is the baseline: "I can see what is happening."

Mode 2: Active Alerting

When thresholds breach, the agent sends notifications through the appropriate channel based on severity:

SeverityChannelExamples
LowDashboard logBundle size increase, minor performance regression
MediumEmailNew error type, p95 SLO warning breach, job latency increase
HighEmail + issuep99 SLO critical breach, error rate spike
CriticalPagerDuty + email + issueService down, data integrity issue, payment processing failure

Mode 3: Autonomous Remediation

The most advanced mode. When the agent detects a critical issue, it does not just notify. It starts working on a fix. This is the same three-tier auto-fix system from the deployment pipeline (Part 9), but triggered by runtime issues rather than deployment failures.

The agent checks out the code, reads the error context, proposes a fix, and either applies it automatically (Tier 1), creates a PR (Tier 2), or creates a detailed issue (Tier 3). The key difference from deployment diagnosis: runtime issues have live user impact, so the agent prioritizes speed for Tier 1 fixes and includes rollback recommendations for Tier 2-3.

Autonomous mode requires guardrails. The agent should never auto-commit database schema changes, security-related fixes, or anything that modifies payment processing. These are always Tier 2 or Tier 3 (human review required). The tier classification is configurable and should be conservative at launch, expanding as confidence grows.