O11y & SRE · Observability

From alert fatigue to actionable SLOs

Replaced a noisy legacy monitoring stack with Prometheus + Grafana + OpenTelemetry. Cut page volume by 73%, improved MTTR by 4×, and gave engineers an error-budget conversation worth having.

Industry
E-Commerce
Engagement
Fixed-Bid + Run
Duration
8 weeks
Team Size
3 engineers
Practices
O11y · SRE Enablement
73%
Fewer Pages / Week
4×
Faster MTTR
61%
Lower O11y Spend
12
SLOs Defined & Live
// the challenge

2,400 alerts a week. Maybe 8 mattered.

The platform team was on-call for an e-commerce site doing $240M GMV/year. Their legacy stack — a paid APM + scattered CloudWatch rules + Slack integrations bolted together by 4 different teams over 5 years — was firing roughly 2,400 alerts per week. Most of them were ignored.

The downstream effects:

The team needed an o11y reset — not another tool added on top, but a re-architecture grounded in golden signals and SLO-driven alerting.

// our approach

SLOs first. Alerts second.

We started by sitting with the product team to define what "the site working" actually meant — 12 user-journey SLOs, each with a target and an error budget. Only then did we touch instrumentation, alerting, or dashboards.

Phase 01

SLO Definition

Workshops with product + engineering to define 12 user-journey SLOs. Targets, windows, error budgets agreed.

Weeks 1–2
Phase 02

Stack Rollout

Prometheus + Grafana + AlertManager + Loki deployed via Helm. OpenTelemetry SDK across 30+ services.

Weeks 2–5
Phase 03

SLO-Driven Alerts

Multi-window, multi-burn-rate alerts replaced threshold rules. Curated dashboards per service domain.

Weeks 5–7
Phase 04

SRE Enablement

On-call rotation redesign, blameless post-mortem template, error-budget review cadence, runbook library.

Weeks 7–8
// architecture

Unified observability pipeline

   App Services (30+)
            
              OpenTelemetry SDK
            
   ┌──────────────────────┐
     OTel Collector      
   └─┬──────┬──────┬──────┘
                 
                 
  Prom   Loki   Tempo
  metrics logs   traces
                 
     └──────┼──────┘
            
       Grafana
       (dashboards + SLO views)
            
            
   AlertManager  ──►  PagerDuty / Slack
   (multi-burn-rate)
// technology stack

Tools we shipped with

PrometheusMetrics
GrafanaDashboards
AlertManagerAlerting
LokiLogging
TempoTracing
OpenTelemetryInstrumentation
Pyrra / SlothSLO Generation
PagerDutyOn-Call
HelmPackaging
TerraformIaC
// outcomes

What changed for the on-call team

MetricBeforeAfterΔ
Alerts / week~2,400~640−73%
Actionable page rate~6%~78%+72 pts
MTTR (p50)52 min13 min−75%
Monthly o11y spend$78K$31K−61%
Dashboards in use~200 (~30 active)48 (all active)Focused
SLOs live in production012New
"The first week after cutover, my on-call shift got 4 pages. Four. Three were real. I almost called NodeOps360 to ask if something was broken with the alerting. Turns out the alerting was finally working."
Staff SRE · E-Commerce Platform
// drowning in alerts?

Let's reset your signal-to-noise

SLO-driven o11y isn't a vendor — it's a discipline. We can help you build it.

Start a conversation