When you're optimizing an API or a backend service, two metrics always come up: throughput and latency. They sound related—both measure performance—but they answer completely different questions. Throughput tells you how many requests your system can handle; latency tells you how fast each one responds. Many teams optimize for one at the expense of the other, and their users feel it. Understanding the difference and measuring both is what separates production systems that scale smoothly from ones that melt under load.
This guide breaks down throughput and latency, shows you why they trade off, and explains how to use both metrics to build systems that actually perform the way your users expect.
Understanding throughput — requests per second
Throughput is the volume of work your system can complete in a fixed time. We measure it in requests per second (RPS) or operations per minute (OPM), depending on the context.
A typical question: Can this API handle 10,000 requests per second? That's a throughput question.
Throughput is often a hardware constraint. A single database server has a maximum number of queries it can execute per second before it starts queueing them. A single web server running on a t3.small has a finite number of concurrent connections it can hold. If you want higher throughput, you typically need more machines, more processing power, or more efficient code.
// High throughput: This endpoint handles many requests quickly
app.get("/ping", (req, res) => {
res.json({ status: "ok" });
});
// Can easily sustain 50,000+ RPS on modern hardware
Throughput is seductive because it scales predictably. If one server handles 1,000 RPS, ten servers handle 10,000 RPS (roughly). That's why it's easy to reason about: throw more hardware at the problem.
Understanding latency — time per request
Latency is the time it takes a single request to complete, measured in milliseconds. It's the question a user asks implicitly every time they wait: Why is this taking so long?
A typical question: Why does this request take 500 milliseconds when it should take 100? That's a latency question.
Latency is what users feel. An API endpoint serving 10,000 RPS but responding in 5 seconds to each user is useless. A service handling 100 RPS but responding in 10 milliseconds creates a snappy, responsive product. Latency determines the perceived speed of your application.
// Low latency: Response time is measured in milliseconds
app.get("/user/:id", async (req, res) => {
const user = await db.query("SELECT * FROM users WHERE id = ?", [req.params.id]);
res.json(user);
});
// Ideally responds in <50ms from request to response sent
Latency is tricky because it's not uniform. The same endpoint might respond in 20 milliseconds most of the time but occasionally spike to 2 seconds if a downstream service is slow, the database is under load, or the network is congested. This is why percentiles matter: the P95 vs P99 latency reveals the worst experiences, not the average.
Throughput vs latency—why both matter
Here's the critical insight: optimizing for throughput does not guarantee low latency, and optimizing for latency does not guarantee high throughput.
Consider two scenarios:
Scenario 1: High throughput, high latency. A batch-processing service queues up 50,000 requests and processes them all in parallel. Throughput is incredible—it processes them all in 10 seconds. But each individual request waits in a queue before processing starts, so latency for the 50,000th request might be 8 seconds. Users see slow responses.
Scenario 2: Low throughput, low latency. An endpoint is optimized to respond in 5 milliseconds but can only handle 100 requests per second. If traffic exceeds 100 RPS, the extra requests queue up and never get that 5-millisecond response time. Under load, latency spikes.
The real world is a balance. You need:
- Enough throughput to handle your peak load without queueing requests indefinitely.
- Low latency so users don't wait, even at high throughput.
The sweet spot is predictable latency under load. You want low P95 and P99 latency even when running at 80% of your throughput capacity. That's the measure of a well-tuned system.
The throughput-latency tradeoff
In practice, there's often a tradeoff. Common optimizations illustrate it:
Batching requests increases throughput but adds latency. A caching layer processes 100 updates in a single batch transaction instead of 100 individual ones—throughput goes up, but the 99th update waits for the batch to fill. Batching only works if you can tolerate that delay.
Connection pooling reduces latency by reusing database connections (no handshake overhead) and increases throughput by not starving other queries. This is a win-win, but it's also rare.
Queueing lets you handle burst traffic (high throughput), but queued requests wait (high latency). A message queue can buffer 100,000 requests, but each one spends time waiting. If that's acceptable (background jobs, analytics), great. If users are waiting for a response, queueing makes things worse.
The key is knowing which tradeoff is right for your use case. For a payment API, you need low latency (users are waiting); acceptable throughput is whatever handles your peak. For a bulk data import, you can sacrifice latency for throughput because the work is asynchronous.
Measuring both in production
You can't optimize what you don't measure. To track throughput vs latency, instrument your system with distributed tracing, which shows both metrics in context. Teams often focus on web performance monitoring for frontend metrics, but backend throughput and latency are equally critical.
Here's a minimal setup with the Sentry SDK to capture transaction data:
import * as Sentry from "@sentry/node";
Sentry.init({
dsn: "https://<key>@your-lighttrace-host/1",
tracesSampleRate: 1.0,
environment: "production",
});
// Every transaction is traced: throughput (count) and latency (duration) are captured automatically
app.use(Sentry.Handlers.requestHandler());
app.use(Sentry.Handlers.errorHandler());
Once you're capturing traces, you'll see:
- Throughput: Total transaction count per time window (transactions per second).
- Latency: Duration of each transaction (p50, p75, p95, p99).
A production dashboard should show both, side by side. When latency spikes while throughput stays flat, you have a per-request slowdown (investigate specific queries or external calls). When throughput drops while requests pile up, you have a bottleneck. When both drop together, you probably have an outage.
A transaction in tracing is a unit of work you want to measure—typically an HTTP request, a background job, or a database operation. Each transaction has a start time, an end time, and a duration (latency). The sum of all transactions in a time window is throughput.
How to optimize for what your users care about
Different systems have different bottlenecks. Here's a framework:
If throughput is the problem (you're hitting capacity at normal latency):
- Add caching (reduce database load, increase throughput).
- Add more replicas or horizontal scaling.
- Batch related operations (fewer round trips, higher throughput per server).
- Use connection pooling or connection multiplexing.
If latency is the problem (requests are slow even at low load):
- Trace individual requests to find the slow span—is it a database query, an external API call, or N+1 queries? (Span waterfalls make this obvious.)
- Optimize the slow operation: add an index, cache the result, or replace the external call.
- Reduce the number of round trips between services. Follow our reduce API latency playbook for a systematic approach.
If both are problems (high load and slow responses):
- You're hitting a hard limit. Identify the bottleneck with tracing and either optimize it or scale past it.
- Application performance monitoring (APM) tools like LightTrace show you P95/P99 latency alongside transaction rate, so you can see exactly when and where things degrade.
Common mistakes
The two biggest mistakes:
-
Optimizing throughput and ignoring latency. You scale up to 100,000 RPS, but users wait 3 seconds. That's a failed optimization.
-
Optimizing latency with no headroom. You tune a single request down to 10 milliseconds, but you can only do 100 RPS. At 101 RPS, requests queue up and latency explodes because there's no buffer. You need enough throughput to handle spikes without queuing.
The fix: measure both, set targets for both, and test under realistic load. Is your P95 latency 100 milliseconds when you're at 50% throughput capacity? Good. What about at 80% capacity? At 95%? These numbers reveal whether your system scales gracefully or clips.
Start tracking errors in minutes
Track both metrics with LightTrace's transaction monitoring—measure throughput and latency in real time and see where every millisecond goes, free up to 5,000 events a month.
Throughput and latency are not the same, and they rarely optimize together. The systems that feel fast do both well: they handle the traffic and respond quickly. Start measuring both, identify which one is your real constraint, and optimize there first. The result is a system that scales smoothly and users who don't wait.