When production breaks at 2 a.m., you can't spin up a debugger and step through the code. You can't reproduce it locally because it's context-specific — a particular user's timezone, a race condition that only hits under load, or a third-party API responding differently than expected. The entire workflow for debugging production errors is different from the development cycle, and it requires a different toolkit.
This guide walks you through the practical workflow that teams use to trace a production error from the first alert to the deployed fix. You'll learn how to read what the system tells you, reconstruct the path that broke, and ship a fix with confidence.
The stack trace is your first clue
A stack trace is a map: each frame tells you which function was running when it crashed, and in what order. The topmost frame is where the crash actually happened; the bottom frame is where the whole call chain started. That's where your detective work begins.
TypeError: Cannot read properties of undefined (reading 'email')
at getUserEmail (users.ts:42:15)
at formatUserInfo (api.ts:18:22)
at handler (api.ts:5:12)
at processRequest (app.ts:99:33)
Start at the top frame and ask: "What's the exact line?" In this case, line 42 of users.ts tried to read .email from something that was undefined. The question becomes: why is the caller passing undefined? Move down to the next frame and repeat. In this example, the bug is almost certainly in formatUserInfo — it's passing the wrong thing to getUserEmail.
If your stack trace shows app.js:1:52210 instead of a real line number and file, your code is minified and you need source maps. Upload them automatically in CI so every deploy ships readable traces.
The stack trace becomes harder to read when it points at a library or a framework. That's where GitHub source links help enormously — they turn a frame into a one-click jump to the exact line on GitHub, so you see not just the code but the test coverage and commit history behind it.
Reconstruct the path with breadcrumbs
A stack trace tells you where it crashed, but not how you got there. That's where breadcrumbs come in. Breadcrumbs are the events leading up to the crash — API calls made, UI buttons clicked, logs written — so you can replay the user's exact path through your system.
Imagine a crash in checkout with this stack trace:
Error: User not authenticated
at chargeCard (payment.ts:125:8)
at processCheckout (checkout.ts:87:15)
The stack trace tells you the user hit chargeCard without authentication. But why? The breadcrumbs reveal the story:
18:42:03 [http] GET /api/session -> 200 OK
18:42:15 [ui] User clicked "Continue to payment"
18:42:18 [http] POST /api/payment -> 401 Unauthorized (token expired)
18:42:19 Error: User not authenticated
Now the picture is clear. The session was valid initially, but the token expired between the time the user started checkout and when they tried to pay. The fix isn't in the payment code — it's adding token-refresh logic before charging, or extending the session lifetime. Breadcrumbs turned a mystery into a fix.
Connect the error to a release
Every production error should be tagged with the release that introduced it. If you see "this started appearing after 3 p.m.," you should be able to match it to a deploy time. If an error disappears after a deploy, it tells you the fix worked. If it stays even after the code that caused it was supposedly changed, something else is wrong.
Sentry.init({
dsn: "https://<key>@your-lighttrace-host/1",
environment: "production",
release: "api@2.14.3", // Tag every event with the release
});
Once tagged, you can ask: "Did this error exist before this release?" If yes, it's a regression — a reintroduction of an old bug. If no, it's new code. Either way, you know which deploy to investigate. This is why release health monitoring matters: the moment an error rate spikes after a deploy, you know a recent change caused it.
Follow the trace through services
A single user action often spans multiple services. An API call triggers a database query, which spawns a background job, which hits a cache layer. If any of those services crash, you need to see the whole chain — not just your service's piece.
This is what distributed tracing does. A trace ID flows from the frontend through every service that touches the request, and each service logs a span — a record of its work. Visualized as a waterfall, you see exactly where the slowdown or failure happens.
User Action (0ms – 1200ms)
├─ API Request (50ms – 600ms)
│ ├─ Fetch user from DB (30ms – 100ms)
│ ├─ Check permissions (20ms – 80ms)
│ └─ Fetch related data from service B (100ms – 450ms)
│ └─ Redis cache lookup (20ms – 35ms)
│ └─ Hit DB (80ms – 400ms)
└─ Render (600ms – 1200ms)
When the error happens inside "Fetch user from DB" at the 100ms mark, you know that's the culprit. You can dive into the database logs with the exact timestamp and trace ID, and see why that query hung. Without the trace, you're guessing.
Spot patterns across similar errors
One error can tell you one story. Ten identical errors tell you something's systematically wrong. Error trackers group errors by fingerprint so you see issues, not noise.
| Signal | Meaning |
|---|---|
| Same error, 50 occurrences in 5 minutes | New regression — likely introduced by a recent deploy |
| Same error, 1–2 occurrences per day for a week | Edge case bug — happens rarely but consistently |
| Same error, spiking from 2/day to 200/day when a third-party API degrades | Dependency issue, not your code |
Grouping reveals these patterns. Raw events just look like chaos. You can use these patterns to prioritize — a spike is more urgent than a steady low-frequency error, even if both have the same stack trace.
Use AI to skip the guessing
Reading a stack trace and guessing why it happened is the human way. The machine way is to feed the stack trace, breadcrumbs, and context to an LLM and ask it to explain the bug in plain language. That's what AI root-cause analysis does — it points at the exact line and explains what went wrong, which cuts the time from "I see a trace" to "I know what to fix" from minutes to seconds.
Error: Cannot read properties of undefined (reading 'email')
at getUserEmail (users.ts:42:15)
Breadcrumbs:
- [http] GET /api/user?id=undefined
- [ui] User clicked "Send invitation"
AI Explanation:
"The user clicked 'Send invitation' but the user ID wasn't loaded yet.
Line 42 tries to access user.email without checking if user is defined.
Add an optional chaining operator: user?.email"
The AI doesn't replace your judgment, but it saves you the "what's this line doing?" step.
Compare production behavior to local reproduction
Some bugs only happen in production: under load, with real data, with network latency, across multiple time zones. Try to isolate the conditions. If the error only happens for users in a specific country, check for timezone or locale-specific code. If it spikes when traffic is high, it's likely a race condition or resource exhaustion.
Pull the exact inputs from the error event and try to recreate it locally. If you can reproduce it, you're done — you can step through a debugger and fix it. If you can't, the bug is probably environmental. Then check:
- Has the third-party service the error depends on changed recently?
- Did you deploy a change to the edge layer, caching, or request routing?
- Is the error happening in a specific browser or device type?
Temporal correlation is your friend. Errors don't appear randomly. They correlate with deploys, traffic patterns, or external events. Trace those correlations, and you'll find the cause.
Set up alerts to catch it again
Once you ship the fix, you don't want the same error to reappear silently weeks later. That's why alert rules exist. Set a rule: "If this issue appears again, page me." Or better, make it automatic: "If the error rate of any issue spikes 10x in an hour, create an incident."
This transforms debugging from reactive (someone reports it) to proactive (your monitoring detects it). For a deeper playbook on preventing regressions, see the error triage process.
The debug workflow in practice
Here's how a typical debugging session flows:
- Alert arrives. New error:
TypeError: Cannot read properties of undefinedin the checkout flow. - Check the stack trace. Points to
chargeCard()inpayment.ts:125. - Read breadcrumbs. See that the token expired between page load and payment attempt.
- Check the release. This error started 2 hours ago. Last deploy was 3 hours ago — probably not us, unless something changed in a third-party auth library.
- Check the trace. See that the auth service returned 401, and the client didn't refresh the token before retrying.
- Ask the AI. "Why didn't the token refresh?" → Answer: "Your retry logic doesn't call the refresh endpoint when it gets a 401."
- Write the fix. Add token refresh before retrying. Ship it.
- Monitor. Set an alert rule so if this error spikes again, you know immediately.
The entire cycle — from alert to fix to deployment — can happen in under ten minutes with the right information in front of you.
Don't patch-and-pray. After you deploy a fix, monitor the error rate for the next few hours. If it doesn't drop to zero, your fix didn't work or you misdiagnosed the cause. Check the error trend graph before you close the loop.
Debugging production errors is a learned skill. It requires discipline (read every breadcrumb, don't guess), curiosity (why exactly did this happen?), and humility (you'll be wrong sometimes, and that's how you learn). The stack trace is always telling you the truth — it's just in a language that takes practice to read fluently. Start with the trace, follow the breadcrumbs, and let the context guide you to the fix.
Start tracking errors in minutes
See all this in action: point any Sentry SDK at LightTrace to get grouped errors, readable stack traces, and breadcrumbs in minutes.