Alert-driven monitoring

Teams usually associate the idea of infrastructure monitoring as a project to “hook up metrics” and “build dashboards”.

In fact, in almost every monitoring platform, dashboards are the first-class citizen. Teams often see them as the primary output of their work. It feels productive to see rows of glowing charts and telemetry. They make for some cool office art when you put them on a giant TV on the wall. But nobody spends their day watching graphs.

The real core of infrastructure monitoring isn’t dashboards. It’s the alerts.

While other platforms treat alerts as an afterthought, a checkbox you tick after the “real work” of visualization is done, we believe they are the entire point. Alerts are the backbone of your operations.

Start with the failure

When it’s time to set up alerts, most teams start with the metrics they already have. They look at a list of available data points and ask: “I have CPU usage for these servers. What should the threshold be? What’s a reasonable evaluation window?”

This is exactly how you end up with a noisy, untrustworthy system. To build a system you actually trust, you have to start from first principles.

Instead of looking at your metrics, look at your service. Ask yourself: what behavior actually indicates that this service is failing for a user? What behavior predicts that it is about to fail? Generally speaking, what metric behavior could indicate, or even better, predict a service failure?

Tip

Simple Observability includes a catalogue of alert templates to jumpstart your configuration. While these aren't tailored to your specific environment, they serve as an excellent foundation for the iterative hardening process described below.

The boy who cried wolf stage

When setting up alerts, teams prefer to be conservative. They don’t know the optimal thresholds yet, so they understandably tend to play it safe. But this usually starts producing a lot of false alarms.

At first, the notifications are manageable. But then the reality of a live system kicks in.

A cron job runs at 2:00 AM and spikes the CPU for three minutes. Ping…
A random bot crawler hits a few dead links and bumps the error rate. Ping…
A database backup causes a tiny latency lag that clears itself up in seconds. Ping…

You check the first few. You realize they aren’t “real” problems. You go back to work. But the pings don’t stop. They become a steady hum in the background of your day that you learn to ignore.

Eventually, your Slack channel or email folders fill up with alerts, to a point where you can’t even tell what alerts are firing. “Is something actually wrong? Or is it just another Tuesday?”

This is alert fatigue. It’s a feeling that creeps up on teams when monitoring isn’t set up correctly.

The danger zone is when the entire team stops trusting monitoring entirely. This is the boy who cried wolf story. The whole system fails because the team stops believing it.

What to do about it

Fixing alert fatigue isn’t about finding a better math formula for your thresholds. It’s about putting clear systems in place, based on these two simple principles:

Zero tolerance for false alarms

If an alert can be ignored, then it should not be an alert.

Alerts should be actionable. If no action can or should be taken, then the alert is not needed.

Teams must enforce a strict zero-tolerance policy on false alarms. If an alert fires and no action was needed, you don’t just ignore it. You either delete it, or you refine it until it only fires when a human is actually required to intervene.

Continual improvements

You cannot build a perfect monitoring system on day one. You don’t yet know every way your infrastructure will fail, and you can’t predict every edge case.

Instead of trying to architect the perfect system from the start, design a process that makes your system smarter over time. Just as you write unit tests to catch regressions, you should treat alert rules as living code that must be maintained.

In practice, it looks like this:

Weekly reviews: Teams should regularly meet and review every incident triggered by the monitoring system.
Frequent pruning: If an alert was a false alarm, it is deleted immediately. If it didn’t help, it’s noise.
Root cause analysis: If a real incident happened but the monitoring system didn’t catch it until it was too late, perform a root cause analysis. What was the earliest metric that signaled this failure? Create a new alert for that specific behavior so you can catch it earlier next time.

Just as you use unit tests to harden your code, you use this cycle to iteratively harden your monitoring. Your goal is to make your alert rules more robust every single week, while reducing the total number of incidents.

By pushing this iterative system as a team, you make alerts a core part of your engineering culture.