When a user clicks a button in your app and nothing happens, where do you look? Ten years ago, the answer was simple: one server, one log file. Today? Your request probably bounced through a frontend service, three backend microservices, two databases, a cache, and a message queue. By the time your team notices something went wrong, the request is long gone — lost in the noise of millions of others.
Distributed tracing is the practice of following a single request as it moves through your entire system. It answers the question every developer asks when debugging a production failure: "Exactly where did this request get stuck?" Instead of stitching together logs from ten different services, a distributed trace shows you one coherent waterfall of every span of work — who called whom, how long each step took, and where it failed.
This guide explains how distributed tracing works, why it matters for modern systems, and how it turns a debugging nightmare into a guided tour through your infrastructure.
Understanding traces and spans
Distributed tracing is built on two core concepts: traces and spans.
A trace is the complete journey of a single request from entry to exit. It has a unique trace ID that stays constant as the request moves across services, databases, and queues. A trace groups together everything that happened because of that one user action.
A span is a single operation within that trace — a database query, an HTTP call to another service, a message-queue publish, a cache lookup, or even a chunk of synchronous work in a function. Every span has a span ID and knows its parent span. That parent-child relationship is what creates the tree structure: a web request spawns a database query span, which spawns a network call to fetch related data.
Here's a concrete example. A user submits an order form. One trace ID follows the request:
- Span A: Frontend makes HTTP POST to
/api/orders - Span B: Backend
POST /api/ordershandler (child of A) - Span C: Validate user in database (child of B)
- Span D: Check inventory in a separate inventory service (child of B)
- Span E: Reserve stock via database (child of D)
- Span F: Publish order event to Kafka (child of B)
All of these spans carry the same trace ID. When you look at the trace in your dashboard, you see the whole tree at a glance: Span C took 12ms, Span D took 340ms (the bottleneck), Span E took 8ms, Span F took 4ms. The critical path jumps out.
A trace ID and span ID are just numbers (or UUIDs). They live in HTTP headers, message metadata, database logs, and application memory. SDKs and libraries extract them and propagate them automatically, so you don't have to pass them by hand.
How tracing works in practice
Instrumenting a system for distributed tracing requires three pieces to work together:
1. The SDK installs hooks into your application's HTTP client, database driver, and runtime. When code makes a network call or queries a database, the SDK wraps that operation in a span, records timing and metadata, and propagates the trace ID and parent span ID in the outgoing request.
2. Context propagation passes the trace ID and span IDs across service boundaries. If service A calls service B, service A puts the trace ID in an HTTP header. Service B reads that header and creates its own spans as children of the incoming span. Most modern SDKs (like the Sentry SDK) handle this automatically.
3. A backend (your error tracker or APM tool) collects all those spans from all your services, reassembles them into a single trace, and stores it so you can query and visualize it later.
Here's how you'd instrument a simple Node.js service:
import * as Sentry from "@sentry/node";
Sentry.init({
dsn: "https://<key>@your-lighttrace-host/1",
tracesSampleRate: 1.0,
environment: "production",
});
app.get("/api/orders/:id", async (req, res) => {
const transaction = Sentry.startTransaction({
op: "http.server",
name: `GET /api/orders/${req.params.id}`,
});
try {
const order = await db.query("SELECT * FROM orders WHERE id = $1", [
req.params.id,
]);
const user = await userService.fetchUser(order.user_id);
// The database query and HTTP call are automatically wrapped in child spans
res.json({ order, user });
} finally {
transaction.finish();
}
});
The Sentry SDK intercepts the database query and the userService.fetchUser HTTP call, creates child spans automatically, and reports the whole transaction to LightTrace. The trace ID propagates in the HTTP headers without you writing a single line to handle it.
Traces vs logs vs metrics
Many teams already have logs and metrics. Where does tracing fit in?
| Signal | Answers | Example |
|---|---|---|
| Logs | What happened, step by step, on this service? | 2026-07-03 14:22:51 User 42 clicked order; inventory check started |
| Metrics | What's the overall health and throughput? | p99_latency: 850ms, error_rate: 2.1% |
| Traces | Exactly which request failed, and where in the call chain? | trace_id=abc123: POST /orders → db query (12ms) → inventory service (340ms, timeout) → failed |
Logs are unstructured and high-volume — great for post-mortems but hard to query. Metrics are aggregated and anonymous — perfect for dashboards but they hide individual failures. Traces are the middle ground: they follow a single request end-to-end, so you see the exact path and where it broke.
The three pillars of observability tie together logs, metrics, and traces. Distributed tracing is the pillar that makes sense of complex, multi-service systems.
Reading a span waterfall (and finding bottlenecks)
When you open a trace in LightTrace, you see a span waterfall — a visual representation of all the spans in the trace, laid out horizontally by time.
POST /api/orders (0ms - 380ms) ──────────────────────────────────────────
├─ Validate user in db (2ms - 14ms) ─────
├─ Call inventory service (20ms - 360ms) ────────────────────────────────
│ ├─ HTTP POST /inventory/reserve (22ms - 358ms) ───────────────────
│ └─ db query SELECT stock (25ms - 33ms) ───
└─ Publish to Kafka (362ms - 366ms) ──
The waterfall shows:
- Spans arranged top to bottom, with the root span at top (the HTTP request).
- Time on the x-axis. Wider spans took longer.
- Parent-child relationships via indentation.
- Critical path highlighted. The inventory service call (340ms) dominates the trace. If you shave 100ms off that call, the whole request gets faster.
Reading a waterfall is the key to using traces for debugging. Instead of guessing which service is slow, you see it. You spot N+1 queries (ten child database spans when there should be one), timeouts (a span that runs until it hits a timeout limit), and cascading failures (span D fails, so its parent aborts, so its grandparent retries).
Span waterfalls explained goes deeper into reading and interpreting them.
When investigating slow APIs, always check for parallel vs. sequential spans. If two database queries run one after the other, see if they can run in parallel. That alone can cut your p95 latency by 50%.
Why distributed tracing matters for distributed systems
In a monolith, all your code runs in one process. A slow request shows up in one slow endpoint, and you read the logs on one server. You don't need tracing; you just need good logging.
But the moment you split into microservices — or even if you have a frontend calling a backend API — a single user action creates multiple independent requests. Logs from each service arrive separately, with different formats, different timestamps (even if they're in UTC, clock skew is real). Stitching them together by hand is like solving a jigsaw puzzle in the dark.
Distributed tracing solves this by saying: every operation in this system belongs to exactly one trace. From the moment a user clicks a button to the moment the response lands in their browser, all the work — whether it happens in a queue worker, a cache, a database, or a third-party API — carries the same trace ID. When something goes wrong, you don't hunt through five log files. You open the trace, see the waterfall, and immediately spot the slow service or the timeout.
The result? Teams with distributed tracing solve production incidents 5–10x faster than teams without it. Not because they're smarter, but because they spend less time guessing where the problem is.
Distributed tracing and error tracking
Error tracking and distributed tracing are closely related. When an error occurs during a trace, the error report includes the trace ID. LightTrace shows you the full span waterfall that led up to the crash, so you see not just that a function threw an exception, but which service called it, how long the request had been running, and which operations succeeded before the failure.
If a payment endpoint errors 400ms into a trace, you can see: "The user service responded fine (12ms), but the payment API timed out (400ms). That's why we failed." Without the trace context, you'd see only: "Payment API call failed" — and you'd be none the wiser about the timing or the sequence.
Together, error tracking and distributed tracing form a complete picture of what went wrong and why.
Cross-project tracing for complex architectures
In larger organizations, a single user request might span multiple independent services or even multiple teams' projects. If the checkout flow lives in one project, inventory in another, and payment in a third, how does tracing work across projects?
The answer is context propagation — the same mechanism that works within a service works across services. When checkout service calls inventory service, it puts the trace ID in an HTTP header. Inventory service reads that header, creates child spans, and sends them to LightTrace with the same trace ID. All the spans land in one trace, even though they were reported by different projects.
LightTrace's cross-project tracing feature stitches these together into a single waterfall, so you see the complete journey across your entire system, regardless of team boundaries or infrastructure.
Getting started: What you need
To instrument your app for tracing, you need:
- An SDK — LightTrace works with the standard Sentry SDKs for Node.js, Python, Java, Go, and most other languages.
- A DSN — a URL that tells the SDK where to send events (including traces). LightTrace gives you one when you create a project.
- A sample rate — how many traces to record. Set it to 1.0 initially to record everything, then dial it down once you're confident the setup is working.
The whole setup takes minutes. Initialize the SDK, set the sample rate, and deploy. Traces start arriving immediately.
For deeper guidance, see our tutorials on adding error tracking to any app — the same setup captures errors and traces. Distributed tracing and error tracking are two sides of the same coin.
Conclusion
Distributed tracing turns the black box of a microservices architecture into something you can see and debug. It answers the question "where did this request slow down?" not after hours of log hunting, but in seconds, visually. Every span of work carries a trace ID, and that single identifier lets you reconstruct the entire journey, from the user's click to the database query at the end of the chain.
In production systems with multiple services, distributed tracing isn't optional. It's the difference between solving an outage in 30 minutes and spending three hours guessing which service to blame. Start tracing, and the next time a user reports something feels slow, you'll have answers before they finish typing the support ticket.
Start tracking errors in minutes
Start capturing distributed traces across your services — point any Sentry SDK at LightTrace and ship with full visibility into your critical paths.
Understanding trace IDs vs span IDs goes deeper into the mechanics of context propagation. For teams scaling their incident response, reducing API latency shows how to use traces to systematically eliminate bottlenecks.