Being on-call for developers means owning production when it breaks — you're the first person to know when an error spikes or a service degrades, and you're on the hook to either fix it or escalate it. Done right, on-call rotations spread the burden fairly and catch issues before they become fires. Done wrong, they burn people out.
This guide walks through the mechanics of on-call for product teams: how to structure rotations, build escalation paths that work, write runbooks that actually save time, and use the right tools so alerting helps instead of exhausts.
Why on-call matters (and who should be on it)
Every production system throws errors and occasionally slows down. The question is whether those problems wake up a human at midnight or get discovered three weeks later during a review. On-call captures the gap between "something went wrong" and "someone is working on it."
Most teams run on-call rotations for the services they own — usually the engineers who built and deploy the code. On-call isn't a separate role; it's part of shipping. That said, on-call works best when:
- The on-call engineer can actually fix or diagnose the problem (or knows who can escalate to).
- The team has clear runbooks so you're not guessing in a panic.
- Alerts are tuned so they page you for real issues, not noise.
- Rotations are fair and people actually get sleep.
If your team uses error tracking with solid alert rules, you can catch regressions in the first few minutes instead of hours or days. That's the difference between a quick fix and an incident that tanks your crash-free rate.
Structuring on-call rotations
The simplest rotation is a weekly or bi-weekly schedule where one person carries a pager. When a page comes in, that person responds. When the week ends, they hand the pager to the next person on the list.
A few rules make this work:
Make rotations predictable. Post the schedule in your team Slack or shared calendar so people can plan. Nothing kills morale faster than a surprise rotation.
Keep shifts short. A full week is standard; a full month burns people out. If your team is small, split nights and days. The goal is keeping any one person from running on fumes.
Pair junior on-call with senior. New to the codebase? Pair with a senior engineer so you learn the runbooks and escalation chain before going solo.
Compensate fairly. If you got paged at 3 a.m., leave early the next day. If you're on-call over a holiday, take time off later. It's how you avoid resentment.
Consider "shadow on-call" for new team members: they carry the pager with a senior, so they see the runbooks in action and understand escalation before they own it alone.
Escalation: when to page the next person
Not every alert is equal. A single TypeError in one user's session is non-urgent; error rates doubling across your API is a fire. Escalation paths ensure urgent issues get more eyes fast.
A simple structure: the on-call engineer triages and checks runbooks (5–10 minutes), then pages a senior if stuck. If still broken after 10–20 minutes, escalate to a manager or architect. Set explicit thresholds so you don't debug solo for an hour or page unnecessarily at 2 a.m.
Tie alerts to deploys with release health monitoring — an error spike at deploy time often means rollback instead of debug.
Runbooks: turn panic into process
A runbook is a recipe for responding to a specific alert. It's not a full guide; it's a quick checklist: If you see X, do Y. If that doesn't work, escalate to Z.
A typical runbook for "API response time is elevated" might look like:
## API response time > 1 second (p95)
**Severity:** Medium
**On-call action:** Page immediately
**Escalation time:** 10 minutes
1. Check the LightTrace dashboard for this release. Did error rates spike at the same time?
- If yes: likely a regression. Check the recent deploy and consider rollback.
- If no: likely a database or external API issue.
2. Check database query performance in traces. Look for slow queries in the span waterfall.
- If found: identify the slow query, check the query plan, add an index if needed.
3. Check downstream service health (if your API calls other services).
- Latency might be inherited from a slow dependency.
4. After 10 minutes with no diagnosis, page the backend team lead.
Runbooks live in a wiki or a README your team checks during incidents. The best teams link runbooks from their alert rules directly — LightTrace alerts can include a runbook link in the notification, so the on-call engineer sees the action plan instantly.
Runbooks go stale fast. After you use one, update it with what actually worked. After a quarter, audit the runbooks and delete ones that no longer apply (especially after big refactors).
Tuning alerts to reduce noise
Alert fatigue kills on-call. If you page for every blip, people ignore alerts and miss real fires. Smart alerting means:
Alert on outcomes, not details. Don't page for "error rate above 0.1%." Page for "error rate doubled." Use error alerting best practices to set thresholds that predict real impact.
Batch related errors. One issue with 100 events beats 100 alert rules. Use error grouping to collapse duplicates, then alert once.
Ignore harmless noise. Some errors (ResizeObserver warnings, CORS preflight failures) don't affect users. Configure your tracker to exclude them or mark low-severity so they don't page.
Page for new errors immediately. First time you've seen an error? It's likely a regression. For existing errors, only page if rate spikes. This catches regressions without noise.
Tools that make on-call bearable
A good error tracker is table stakes. When you're on-call, you need fast issue grouping so you see the root cause instead of 10,000 duplicate records, rich context (stack traces, breadcrumbs, affected users, release info) so you triage in seconds, source links to jump straight to the broken code, and distributed tracing to follow a request across services.
Beyond error tracking, keep your runbooks in a shared wiki (Google Docs, Confluence, or GitHub) so the team can update them on the fly. Maintain an on-call log where people post "I restarted X" or "Rolled back the 3 p.m. deploy" — future rotations learn fast from what worked. Use your error tracker's alerts as the single source of truth; don't duplicate pages via Slack bots. Let the error tracker notify directly, and reserve Slack for incident discussion.
Measuring on-call health and building a culture that works
After a few rotations, ask yourself:
- How often is the on-call engineer paged? (Target: 1–2 times per week for a stable team.)
- How long does it take to diagnose issues? (Target: under 5 minutes for most alerts, under 15 for hard ones.)
- How many pages are actually user-facing issues vs. noise? (Target: >70% user-facing.)
- Are people burning out? (If yes, lighten the rotation or improve alert tuning.)
Mean-time-to-resolution (MTTR) is the metric that matters most. If your alerts are good and your runbooks are clear, MTTR should be under 15 minutes for most issues. If it's regularly over an hour, something's broken — either the runbooks are stale, the alerts are vague, or the on-call engineer lacks context.
Being on-call isn't glamorous. It means staying near a laptop, context-switching, and occasionally losing sleep. But it's also the fastest way to learn your system. You see what breaks, why it breaks, and how to fix it. After a few weeks on rotation, you know your codebase better than code review alone could teach you.
The best on-call cultures treat it as a skill, not a chore. Invest in runbooks, tune alerts ruthlessly, and support people carrying the pager. The payoff is a team that ships confidently because they know production isn't a mystery.
Start tracking errors in minutes
Set up error alerts and track issues from the moment they hit production — sign up for LightTrace free to monitor errors and keep your on-call rotation sane.
On-call works when everyone knows the plan: what pages you, how fast you need to respond, and when to call for backup. With a solid error tracker, clear runbooks, and fair rotations, on-call becomes a routine part of shipping instead of an all-hands emergency.