Error Alerting Best Practices (Beat Alert Fatigue)

Alert fatigue destroys on-call rotations. When every deployment triggers a dozen warnings, your team stops reading them — and the day a real regression lands, it gets missed in the noise. Error alerting best practices are about designing rules that page you for what actually matters: newly introduced bugs and unexpected spikes in existing errors. Everything else stays quiet.

Most teams get alerting backwards. They start with "alert on everything" and gradually tune down, training their team to ignore most alerts along the way. Instead, start conservative, add noise only when you have evidence it matters, and make every rule earn its place in your on-call rotation.

The anatomy of a good alert rule

A production alert has three jobs: detect a real problem, prioritize it correctly, and route it to someone who can fix it. Most alert fatigue comes from rules that fail one of these three.

An alert rule needs:

Scope — which errors, projects, or environments does this apply to?
Threshold — what's the bar for alerting? (New issue, 10 errors/min, 2% spike?)
Lookback window — over what time period do you measure this?
Routing — who gets paged?

Bad rules get at least one of these wrong. A rule that fires on "any new error in staging" has poor scope. One that pages on a single event lacks a proper threshold. One that rolls up ten unrelated errors into one alert has poor signal-to-noise.

Here's what good looks like:

{
  "name": "New high-severity bug in production",
  "scope": {
    "projects": ["api", "web"],
    "environments": ["production"]
  },
  "condition": "New issue with level >= error",
  "routing": "team-oncall"
}

This rule is specific, has clear scope, and fires only when something genuinely new lands. It won't page you for the 1,000th occurrence of a known issue or for a warning in staging.

New-issue alerts vs frequency alerts

The two main alert types do different things, and you need both.

New-issue alerts fire the moment a novel error lands in production. These should be conservative — you want to know about regressions immediately. But "new issue" means your fingerprinting has to be good, or you'll alert on the same bug fifty times.

Frequency alerts trigger when a known error suddenly spikes — normally it's 5/min, now it's 100/min. These catch production incidents that your team can actually respond to.

Alert type	When to use	Risk
New issue	Catch regressions immediately	False positives if fingerprinting is poor; alerts on trivial new paths
Frequency spike	Catch degradation in existing errors	Miss slow burns; require tuning per error; can trigger on just noise growth
Threshold breach	Page when error count exceeds a ceiling	Too blunt; fires even if the error is known and acceptab

A real strategy uses all three, tuned for what actually breaks your business. A payment retry service gets a frequency alert. A minor logging error gets nothing unless it spikes 50x.

Scope your alerts to reduce noise

The biggest mistake is alerting on everything, then wondering why your team ignores pages. Instead, alert on what has business impact.

Start with production only. Staging breaks constantly — developers are testing. Alerting on staging errors trains people to ignore warnings, period.

Second, consider which errors actually matter. If your app throws 100 different errors daily, most are harmless edge cases. Alert only on:

Errors affecting paid features (checkout, billing, auth)
Errors affecting user-facing flows (home page, dashboard, critical path)
Errors that correlate with loss of service (database unavailable, service down)
Errors that leak sensitive data (caught by data scrubbing)

Everything else deserves visibility in your dashboard, but not a 2 a.m. page.

Avoid alerting on framework noise — ResizeObserver loop limit exceeded, unhandled promise from a third-party SDK, 404s on static assets. These flood your feed and are almost never actionable. Fingerprint them as expected and leave them unalerted.

Set thresholds that match your risk tolerance

A frequency threshold that's too low pages you constantly. One that's too high misses the incident. The right number depends on your business and your error types.

For a critical API endpoint, you might alert when error rate hits 1% over 5 minutes. For a background job that runs daily, one failure is worth investigating. For a mobile app, you might accept 0.1% errors because recovery is automatic.

A practical way to set thresholds is to start with your error budget — how many errors can you tolerate before users notice? Once you know that, work backwards to the alert threshold.

Error budget: 0.1% of requests (99.9% success rate)
Traffic: 1M requests/hour
Acceptable errors: 1,000/hour
Alert threshold: 2,000 errors/5 min (would consume 40% of budget)

That threshold lets you catch major problems while accepting noise and variation. Revise it quarterly as your traffic and reliability goals shift.

Route alerts and prioritize by severity

A critical alert routed to everyone is an alert ignored by everyone. Route by ownership: the backend team owns database errors, the frontend team owns render crashes, on-call owns customer-facing incidents.

LightTrace sends alerts by email — route them so the person who wrote the code sees it, not a Slack channel where it disappears. If you have on-call rotation, your tooling should tie alerts to the current on-call person.

Not all errors are equal. A 500 error is worse than a 400 error. A customer-impacting bug is worse than internal tooling. Capture this in your event data by setting the level field (error vs warning vs info) and adding tags.

Sentry.captureException(error, {
  level: isPaidFeature ? "error" : "warning",
  tags: {
    feature: "checkout",
    impact: "customer-facing",
    service: "payment-processor"
  }
});

Then route alerts based on these tags:

{
  "name": "Customer-facing error spike",
  "condition": "level=error AND tags.impact=customer-facing AND count > 100/min",
  "routing": "escalation-contact"
}

This lets you alert loudly on what matters and stay quiet on what doesn't.

Create a separate alert rule for "pages down / unavailable" that goes to your escalation contact immediately. Everything else routes to the team that owns the code. This prevents important incidents from sitting in an email inbox while developers are focused on a routine bug.

Monitor and refine your alert system

The best alert rules degrade over time. An error that was business-critical gets fixed, deployed to production, and now fires daily on old clients still running the old version. A threshold that was perfect at 1M daily requests doesn't scale to 10M. Quarterly, review your recent alerts:

Which ones were false alarms?
Which ones did the team actually investigate?
Which ones paged people for issues they already knew about?

Disable or retune the noisy ones. If you have release health tracking, use that to catch regressions — they'll be correlated with a deploy, making the alert actionable instead of mysterious.

An alert is only useful if it connects to action. When an alert fires:

The on-call person sees it immediately
They acknowledge it and open an investigation
They pull the stack trace, breadcrumbs, and affected users
They create an incident ticket if needed
They fix it or escalate

If any of these steps require manual work (forwarding the email, copying the error, searching for the issue), people skip it. The alert system that integrates with your incident management, links to the right dashboard, and auto-creates tickets wins.

LightTrace alerts deliver via email, so your workflow should start with your email client becoming your incident triage tool. For teams on call, set email to interrupt-priority in your notification settings — these alerts only fire when they matter.

Build alerts incrementally

Don't launch with ten alert rules. Start with one: new issues in production. Get comfortable with the format, tune the false-positive rate, and train your team on the workflow. Then add a second rule for your most critical path — your checkout flow, your login, whatever can't fail.

After a few weeks, add frequency alerts for your top 5 errors by volume. Let the team feel the burn of alert fatigue, then optimize. It's easier to tune rules when you have real data.

As you scale, you'll add release-health monitoring — alert when a deploy introduces a regression. You'll add correlation rules — alert when error rate is up and latency is up, indicating a real incident. But start simple: detect regressions, stay quiet otherwise.

Start tracking errors in minutes

Set up error alerts that catch real regressions without the noise — create your first alert rule in LightTrace in minutes and start your free trial today.

Start free →Sign in

Error alerting is a skill. Most teams learn it the hard way — through a year of ignoring false alarms and then overreacting to the real one. Use these practices to skip that stage: scope to what matters, set thresholds based on risk, route to owners, and monitor for noise. The result is a team that actually reads alerts, responds faster to real problems, and ships with confidence.

The anatomy of a good alert rule

New-issue alerts vs frequency alerts

Scope your alerts to reduce noise

Set thresholds that match your risk tolerance

Route alerts and prioritize by severity

Monitor and refine your alert system

Build alerts incrementally

Start tracking errors in minutes

Fix your next production error faster

Related reading

Alert Fatigue: Why Your Team Ignores Alerts

How to Set Up Error Alerts That Matter

On-Call for Developers: A Practical Guide

Crash-Free Rate: The Metric That Predicts Retention