Designing Load Shedding and Backpressure for APIs
Learn how to protect APIs during overload with admission control, bounded queues, backpressure signals, and clear degradation rules.
Introduction
Every production API has a capacity limit. The limit may be CPU, database connections, queue workers, third-party quota, memory, lock contention, or the number of requests a downstream dependency can handle before latency rises sharply. When traffic exceeds that limit, the worst response is often to accept everything and hope the system catches up.
Overload is not only a traffic spike problem. A slow database can make normal traffic look like a spike because requests stay in flight longer. A queue consumer that falls behind can turn minutes of work into hours of backlog. A retry storm can multiply traffic at the exact moment a dependency is least able to serve it.
Load shedding and backpressure are the controls that keep overload contained. Load shedding rejects or degrades lower-priority work before it consumes scarce capacity. Backpressure tells callers, workers, or producers to slow down instead of pushing unlimited work into a saturated system.
This article walks through practical API patterns for overload protection: choosing the right signal, rejecting early, bounding queues, giving useful client guidance, and testing the failure modes before they become incidents.
Start with the Capacity You Need to Protect
Load shedding only works when it is tied to a real bottleneck. A generic "server is busy" flag is usually too vague to tune. Start by naming the resource that becomes unsafe under pressure.
Common protected resources include:
- Request handler concurrency.
- Database connection pool availability.
- Worker queue depth and oldest job age.
- CPU or event loop delay.
- Memory pressure.
- Third-party API quota.
- Payment, email, or fulfillment dependencies with strict rate limits.
Each resource needs a different control. If the database pool is saturated, accepting more write requests makes latency worse. If the queue is growing but the HTTP layer is healthy, you may reject new job submissions while still serving reads. If a third-party quota is nearly exhausted, you may degrade optional enrichment while preserving core transactions.
Prefer leading indicators
The best overload signal appears before users see a total outage. Error rate is useful, but it is often late. Better early indicators include rising p95 latency, increasing in-flight requests, connection pool wait time, queue age, and retry volume.
A simple admission decision can combine a few local signals:
const overloadState = {
inFlightRequests: 0,
maxInFlightRequests: 400,
eventLoopDelayMs: 0,
maxEventLoopDelayMs: 120,
databasePoolWaitMs: 0,
maxDatabasePoolWaitMs: 80,
};
function isOverloaded() {
return overloadState.inFlightRequests >= overloadState.maxInFlightRequests ||
overloadState.eventLoopDelayMs >= overloadState.maxEventLoopDelayMs ||
overloadState.databasePoolWaitMs >= overloadState.maxDatabasePoolWaitMs;
}
This is intentionally conservative. You should tune thresholds with production metrics and load tests. The important part is that admission control is based on conditions that predict saturation, not only on failures after saturation has already happened.
Reject Early with Admission Control
Once the system knows it is under pressure, it should reject work before expensive processing begins. Rejecting after parsing a large body, opening a database transaction, or calling another service wastes the capacity you are trying to protect.
In an HTTP API, an overload guard should run near the start of the request pipeline:
function overloadGuard(req, res, next) {
overloadState.inFlightRequests += 1;
res.on("finish", () => {
overloadState.inFlightRequests -= 1;
});
res.on("close", () => {
if (!res.writableEnded) {
overloadState.inFlightRequests -= 1;
}
});
if (!isOverloaded()) {
return next();
}
if (isCriticalRoute(req)) {
return next();
}
res.setHeader("Retry-After", "5");
res.setHeader("X-Degraded", "overload");
return res.status(503).json({
error: "service_overloaded",
message: "The service is temporarily overloaded. Retry shortly.",
});
}
The critical-route exception should be narrow. Health checks, authentication refresh, payment confirmation, and incident-response endpoints may deserve priority. Normal product traffic should not all be labeled critical, or the control stops protecting anything.
Make shedding policy explicit
Useful shedding policies usually classify requests before the overload happens:
- Must serve: health checks, safety-critical callbacks, and already-paid order confirmation.
- Serve if possible: normal reads and writes.
- Degrade first: recommendations, analytics enrichment, non-essential personalization, preview generation.
- Reject first: exports, bulk imports, expensive searches, low-priority background triggers.
This classification is a product and operations decision, not only a code decision. A fast 503 on an export may be acceptable during overload. A silent failure on a payment callback is not.
Use Backpressure Instead of Infinite Queues
Queues are useful because they decouple producers from workers. They are dangerous when they hide overload. An unbounded queue can accept hours of work that the system has no realistic chance of completing on time.
Backpressure starts by defining queue limits:
- Maximum queue length.
- Maximum oldest-message age.
- Maximum jobs per tenant or producer.
- Maximum worker concurrency.
- Deadline after which a job is no longer useful.
When a queue crosses the limit, producers should slow down, retry later, or receive a clear rejection.
async function enqueueExportJob(req, res) {
const queueStats = await exportsQueue.stats({
tenantId: req.user.tenantId,
});
if (queueStats.depth >= 500 || queueStats.oldestAgeSeconds >= 900) {
res.setHeader("Retry-After", "120");
return res.status(503).json({
error: "export_queue_saturated",
message: "Export capacity is temporarily full. Retry later.",
});
}
const job = await exportsQueue.enqueue({
tenantId: req.user.tenantId,
requestedBy: req.user.id,
reportType: req.body.reportType,
deadlineAt: new Date(Date.now() + 30 * 60 * 1000).toISOString(),
});
return res.status(202).json({
jobId: job.id,
statusUrl: `/exports/${job.id}`,
});
}
The queue is still useful here. It absorbs short bursts and lets workers process at a steady rate. The difference is that the API refuses to turn a temporary burst into an unbounded backlog.
Bound worker concurrency
Workers also need backpressure. Starting too many jobs at once can overload the same database, storage service, or third-party API the queue was meant to protect.
async function workerLoop(queue, { maxConcurrency }) {
const active = new Set();
while (true) {
while (active.size < maxConcurrency) {
const job = await queue.claimNext();
if (!job) {
break;
}
const task = processJobWithDeadline(job)
.finally(() => active.delete(task));
active.add(task);
}
await Promise.race([
...active,
new Promise((resolve) => setTimeout(resolve, 250)),
]);
}
}
Concurrency should be tuned against the bottleneck, not the number of CPU cores alone. A worker that mostly waits on a third-party API may need a different limit from a worker that compresses large files or performs CPU-heavy analysis.
Give Clients Useful Signals
Clients handle overload better when the API response is machine-readable. Use stable status codes, headers, and error codes.
For synchronous APIs:
- Use
503 Service Unavailablefor temporary overload. - Include
Retry-Afterwhen a retry delay is useful. - Use
429 Too Many Requestsfor quota or rate-limit enforcement. - Return a stable JSON error code such as
service_overloaded. - Avoid telling clients to retry when the operation is no longer useful.
For asynchronous APIs:
- Return
202 Acceptedonly when the job was actually accepted. - Return
503when the queue cannot accept more work. - Provide a status endpoint for accepted jobs.
- Include a deadline or expiration when work has time-sensitive value.
Client-side retry code should respect these signals and preserve the user's deadline:
async function fetchWithOverloadHandling(url, options = {}) {
const startedAt = Date.now();
const totalTimeoutMs = options.totalTimeoutMs ?? 5000;
for (let attempt = 0; attempt < 3; attempt += 1) {
const response = await fetch(url, options);
if (![429, 503].includes(response.status)) {
return response;
}
const retryAfterHeader = response.headers.get("Retry-After");
const retryAfterMs = retryAfterHeader ? Number(retryAfterHeader) * 1000 : 250;
const jitterMs = Math.floor(Math.random() * 250);
const delayMs = retryAfterMs + jitterMs;
if (Date.now() - startedAt + delayMs >= totalTimeoutMs) {
return response;
}
await new Promise((resolve) => setTimeout(resolve, delayMs));
}
return fetch(url, options);
}
This client retries only overload-shaped responses, uses server guidance, adds jitter, and stops when the user's time budget is gone. It does not retry forever just because the server said the failure was temporary.
Degrade Before Failing When It Helps
Not every overload response needs to be a hard rejection. Some systems can protect capacity by serving a cheaper version of the request.
Examples:
- Return cached search results instead of live ranking.
- Skip recommendation calls while returning the core product page.
- Disable non-essential webhooks during bulk imports.
- Return a smaller page size.
- Omit expensive aggregation fields from an analytics response.
Degradation is useful only when the client contract allows it. If the response looks normal but silently omits required work, you have created a correctness bug. Make degraded behavior visible with response fields, headers, logs, and metrics.
async function getProductPage(req, res) {
const product = await productStore.get(req.params.productId);
if (isOverloaded()) {
const cachedRecommendations = await recommendationCache.get(product.id);
return res.json({
product,
recommendations: cachedRecommendations ?? [],
degraded: true,
degradedReason: "recommendations_overload",
});
}
const recommendations = await recommendationsService.forProduct(product.id);
return res.json({
product,
recommendations,
degraded: false,
});
}
This pattern preserves the main page while making the missing freshness explicit. It also gives monitoring a signal that the service is surviving by degrading, not operating normally.
Test Overload as a Feature
Load shedding and backpressure are reliability features. They need tests, dashboards, and runbooks.
Test at least these scenarios:
- The API rejects non-critical requests when in-flight count crosses the threshold.
- Critical routes still work while lower-priority routes shed load.
- Queue submissions return
503when depth or oldest age crosses the limit. - Accepted jobs include status information and deadlines.
- Clients honor
Retry-Afterand stop retrying when their deadline expires. - Degraded responses are marked and measured.
- Recovery clears overload state and normal traffic resumes.
The most valuable load test is not a maximum throughput benchmark. It is a saturation test that answers: when the bottleneck slows down, does the system fail fast, preserve critical work, and recover cleanly?
Operational dashboards should include:
- Requests shed by route, tenant, and reason.
- Queue depth and oldest job age.
- In-flight requests and handler latency.
- Database pool wait time.
- Event loop delay or CPU saturation.
- Retry volume from clients.
- Degraded responses by feature.
If a dashboard only shows total error rate, operators will not know whether the system is rejecting intentionally or failing accidentally. Load shedding should be visible as controlled behavior.
Conclusion and Next Steps
Load shedding and backpressure turn overload from a surprise into an explicit system behavior. Instead of accepting unlimited work, a resilient API protects the bottleneck, rejects early when needed, bounds queues, gives clients useful retry guidance, and degrades only when the contract supports it.
Start with one overloaded path you already understand. Pick the protected resource, define a leading signal, add an admission guard, cap the queue or concurrency, and measure every rejection. Then run a saturation test. The goal is not to avoid every 503; it is to make overload predictable enough that critical work survives and the system recovers quickly.