Replaced a noisy legacy monitoring stack with Prometheus + Grafana + OpenTelemetry. Cut page volume by 73%, improved MTTR by 4×, and gave engineers an error-budget conversation worth having.
The platform team was on-call for an e-commerce site doing $240M GMV/year. Their legacy stack — a paid APM + scattered CloudWatch rules + Slack integrations bolted together by 4 different teams over 5 years — was firing roughly 2,400 alerts per week. Most of them were ignored.
The downstream effects:
The team needed an o11y reset — not another tool added on top, but a re-architecture grounded in golden signals and SLO-driven alerting.
We started by sitting with the product team to define what "the site working" actually meant — 12 user-journey SLOs, each with a target and an error budget. Only then did we touch instrumentation, alerting, or dashboards.
Workshops with product + engineering to define 12 user-journey SLOs. Targets, windows, error budgets agreed.
Weeks 1–2Prometheus + Grafana + AlertManager + Loki deployed via Helm. OpenTelemetry SDK across 30+ services.
Weeks 2–5Multi-window, multi-burn-rate alerts replaced threshold rules. Curated dashboards per service domain.
Weeks 5–7On-call rotation redesign, blameless post-mortem template, error-budget review cadence, runbook library.
Weeks 7–8App Services (30+) │ │ OpenTelemetry SDK ▼ ┌──────────────────────┐ │ OTel Collector │ └─┬──────┬──────┬──────┘ │ │ │ ▼ ▼ ▼ Prom Loki Tempo metrics logs traces │ │ │ └──────┼──────┘ ▼ Grafana (dashboards + SLO views) │ ▼ AlertManager ──► PagerDuty / Slack (multi-burn-rate)
| Metric | Before | After | Δ |
|---|---|---|---|
| Alerts / week | ~2,400 | ~640 | −73% |
| Actionable page rate | ~6% | ~78% | +72 pts |
| MTTR (p50) | 52 min | 13 min | −75% |
| Monthly o11y spend | $78K | $31K | −61% |
| Dashboards in use | ~200 (~30 active) | 48 (all active) | Focused |
| SLOs live in production | 0 | 12 | New |
SLO-driven o11y isn't a vendor — it's a discipline. We can help you build it.
Start a conversation