NodeOps360 — From Alert Fatigue to Actionable SLOs

From alert fatigue to actionable SLOs

Replaced a noisy legacy monitoring stack with Prometheus + Grafana + OpenTelemetry. Cut page volume by 73%, improved MTTR by 4×, and gave engineers an error-budget conversation worth having.

Industry

E-Commerce

Engagement

Fixed-Bid + Run

Duration

8 weeks

Team Size

3 engineers

Practices

O11y · SRE Enablement

73%

Fewer Pages / Week

4×

Faster MTTR

61%

Lower O11y Spend

SLOs Defined & Live

2,400 alerts a week. Maybe 8 mattered.

The platform team was on-call for an e-commerce site doing $240M GMV/year. Their legacy stack — a paid APM + scattered CloudWatch rules + Slack integrations bolted together by 4 different teams over 5 years — was firing roughly 2,400 alerts per week. Most of them were ignored.

The downstream effects:

Engineers had silenced 60% of channels — including the ones that actually mattered during real incidents.
MTTR was climbing because the signal-to-noise ratio made root-cause analysis a manual scavenger hunt across 4 tools.
Monthly observability bill was $78K and growing — but nobody trusted the dashboards.
There was no concept of SLOs. Product asked "is the site working?" and nobody had a confident answer.

The team needed an o11y reset — not another tool added on top, but a re-architecture grounded in golden signals and SLO-driven alerting.

SLOs first. Alerts second.

We started by sitting with the product team to define what "the site working" actually meant — 12 user-journey SLOs, each with a target and an error budget. Only then did we touch instrumentation, alerting, or dashboards.

Phase 01

SLO Definition

Workshops with product + engineering to define 12 user-journey SLOs. Targets, windows, error budgets agreed.

Weeks 1–2

Phase 02

Stack Rollout

Prometheus + Grafana + AlertManager + Loki deployed via Helm. OpenTelemetry SDK across 30+ services.

Weeks 2–5

Phase 03

SLO-Driven Alerts

Multi-window, multi-burn-rate alerts replaced threshold rules. Curated dashboards per service domain.

Weeks 5–7

Phase 04

SRE Enablement

On-call rotation redesign, blameless post-mortem template, error-budget review cadence, runbook library.

Weeks 7–8

Unified observability pipeline

   App Services (30+)
            │
            │  OpenTelemetry SDK
            ▼
   ┌──────────────────────┐
   │  OTel Collector      │
   └─┬──────┬──────┬──────┘
     │      │      │
     ▼      ▼      ▼
  Prom   Loki   Tempo
  metrics logs   traces
     │      │      │
     └──────┼──────┘
            ▼
       Grafana
       (dashboards + SLO views)
            │
            ▼
   AlertManager  ──►  PagerDuty / Slack
   (multi-burn-rate)

What changed for the on-call team

Metric	Before	After	Δ
Alerts / week	~2,400	~640	−73%
Actionable page rate	~6%	~78%	+72 pts
MTTR (p50)	52 min	13 min	−75%
Monthly o11y spend	$78K	$31K	−61%
Dashboards in use	~200 (~30 active)	48 (all active)	Focused
SLOs live in production	0	12	New

Metric

Before

After

Alerts / week

~2,400

~640

−73%

Actionable page rate

~6%

~78%

+72 pts

MTTR (p50)

52 min

13 min

−75%

Monthly o11y spend

$78K

$31K

−61%

Dashboards in use

~200 (~30 active)

48 (all active)

Focused

SLOs live in production

New

"The first week after cutover, my on-call shift got 4 pages. Four. Three were real. I almost called NodeOps360 to ask if something was broken with the alerting. Turns out the alerting was finally working."

Staff SRE · E-Commerce Platform

From alert fatigue to actionable SLOs

2,400 alerts a week. Maybe 8 mattered.

SLOs first. Alerts second.