Keeping It Running
Observability stack, SLO monitoring, and an autonomous ops agent that diagnoses problems the moment they appear, not when you notice them.
~8 minute read
The Observability Stack
Observability is not optional, even for a solo founder. Without it, you learn about production issues from customer complaints. With it, you learn before customers notice.
Five Pillars
| Pillar | What It Monitors | Tool Category |
|---|---|---|
| Error tracking | Unhandled exceptions, error rates, error trends | Sentry or equivalent |
| Logging | Structured logs from all services, searchable, queryable | Axiom, Datadog, or equivalent |
| Analytics | User behavior, feature adoption, session replay | PostHog, Amplitude, or equivalent |
| Uptime | Health endpoint monitoring, SSL expiry, DNS resolution | UptimeRobot, Better Uptime |
| APM | Response times (p50/p95/p99), database query times, external API latency | New Relic, Datadog APM |
What to Instrument
- Every API route: Response time, status code, tenant ID. Structured log per request.
- Every database query: Query time, rows returned. Flag queries over 100ms.
- Every AI/LLM call: Provider, model, tokens in/out, latency, cost. Per tenant.
- Every background job: Start time, end time, success/failure, retry count.
- Every file operation: Upload size, download latency, storage provider response time.
- Client-side: Page load time, largest contentful paint, interaction to next paint, JavaScript errors.
SLO Definition
Service Level Objectives define "how good is good enough." Without SLOs, everything feels urgent. With SLOs, you know which degradations matter and which are within tolerance.
Example SLO Table
| Operation | Target (p95) | Critical (p99) | Measurement |
|---|---|---|---|
| Page load (initial) | <1.5s | <3s | Client-side timing via analytics |
| API response (read) | <200ms | <500ms | Server-side timing via APM |
| API response (write) | <500ms | <1s | Server-side timing via APM |
| Search query | <300ms | <800ms | Server-side timing |
| AI extraction | <30s | <60s | Background job timing |
| Uptime | 99.9% | 99.5% | Health endpoint monitoring |
| Error rate | <0.1% | <1% | Error tracker event rate |
SLO breaches generate alerts. Warning-level breaches (p95 exceeded) are logged. Critical breaches (p99 exceeded) trigger PagerDuty notifications and enter the autonomous diagnosis flow.
The Autonomous Ops Agent
For a solo founder, an always-on monitoring agent is not a luxury. It is a necessity. You cannot watch dashboards 24/7. The ops agent watches for you and acts when something goes wrong.
Architecture
The ops agent runs on a separate, always-on machine (dedicated server or cloud instance). It has: full code checkout access (for diagnosis), API access to all observability tools, database read access (for health queries), and the ability to create issues and PRs.
Integration Points
The agent pulls from every service in the stack:
- Hosting platform: Deployment status, build logs, runtime logs, function invocations.
- Database: Connection count, query performance, replication lag, storage usage.
- Error tracking: New errors, error rate trends, release correlation.
- Logging: Structured log queries, anomaly detection.
- Analytics: Feature flag state, session replay links for error contexts.
- Background jobs: Queue depth, failure rates, retry counts, latency trends.
- Auth: Active sessions, failed auth attempts, MFA adoption.
- Email: Delivery rates, bounce rates, spam reports.
- Payments: Failed charges, MRR, churn indicators.
- AI/LLM providers: Token usage, cost per provider, error rates, latency trends.
Three Modes of Operation
Mode 1: Passive Monitoring
Continuous polling of all integration points. Threshold tracking. Trend visualization on a dashboard. No action taken. This is the baseline: "I can see what is happening."
Mode 2: Active Alerting
When thresholds breach, the agent sends notifications through the appropriate channel based on severity:
| Severity | Channel | Examples |
|---|---|---|
| Low | Dashboard log | Bundle size increase, minor performance regression |
| Medium | New error type, p95 SLO warning breach, job latency increase | |
| High | Email + issue | p99 SLO critical breach, error rate spike |
| Critical | PagerDuty + email + issue | Service down, data integrity issue, payment processing failure |
Mode 3: Autonomous Remediation
The most advanced mode. When the agent detects a critical issue, it does not just notify. It starts working on a fix. This is the same three-tier auto-fix system from the deployment pipeline (Part 9), but triggered by runtime issues rather than deployment failures.
The agent checks out the code, reads the error context, proposes a fix, and either applies it automatically (Tier 1), creates a PR (Tier 2), or creates a detailed issue (Tier 3). The key difference from deployment diagnosis: runtime issues have live user impact, so the agent prioritizes speed for Tier 1 fixes and includes rollback recommendations for Tier 2-3.