Taming Tail Latency with Hedged Requests
Learn how to cut p99 latency in APIs and AI agents with hedged requests: tuning the hedge delay, hedging only idempotent work, and capping the extra load.
Introduction
Average latency is a comforting lie. A service can report a 40 ms mean while a meaningful slice of requests take 500 ms or more, and it is those slow requests — the tail — that users actually feel. The problem compounds the moment one operation depends on many others. A request that fans out to ten backends waits for the slowest of the ten, so a rare per-call delay becomes a common per-request delay. Tool-using AI agents feel this acutely: a single run issues many sequential model and tool calls, and one slow call per step adds up into a run that drags.
Retries do not fix this. A retry helps when a call fails, but a slow-yet-successful call never triggers one — you just wait. Hedged requests target exactly that gap. The idea, popularized by Google's "The Tail at Scale," is simple: send the request, and if it has not answered within a short delay, send a second copy to another replica. Take whichever responds first and cancel the rest. You trade a little extra load for a dramatically shorter tail.
This article shows how to implement request hedging correctly: how it works, how to pick the hedge delay, why you must only hedge idempotent work, and how to cap the extra load so hedging never amplifies an incident. The examples use TypeScript and apply to API gateways, backend services, and AI agent runtimes alike.
Why Tail Latency Dominates Real Systems
The tail matters because it is contagious under fan-out. Suppose each backend responds slowly — say above 200 ms — only 1% of the time. That sounds negligible until a request has to wait on several of them. A request touching 10 such backends is slow whenever any of them is slow, which happens with probability 1 - 0.99^10 ≈ 9.6%. Touch 100 backends and the chance climbs to 1 - 0.99^100 ≈ 63%. A one-in-a-hundred event at the component level becomes the common case at the request level.
AI agents recreate this pattern in time rather than in parallel. An agent that runs twenty steps, each making one model call and one tool call, is a chain of forty dependencies. If each call has a small chance of landing in its own slow tail, the odds that at least one step stalls across the whole run are high, and the user watches the agent hang on that single unlucky call.
This is why tuning the average is the wrong target. Shaving 5 ms off the median does nothing for the request that waited 800 ms on one straggling replica. Stragglers come from causes you cannot fully eliminate: garbage collection pauses, a cold cache, a noisy neighbor on shared hardware, a queue that briefly backed up, a slow disk seek. Hedging accepts that stragglers exist and routes around them instead of trying to prevent every one.
How Request Hedging Works
Hedging runs on a timer. Send the primary request and start a clock. If the primary answers before the hedge delay elapses — which it usually will — you never send a second request and there is no extra cost. If the delay passes with no answer, send a duplicate to a different replica and race them. The first success wins; the loser is cancelled.
The mechanics in TypeScript lean on AbortController to cancel the loser and Promise.any to resolve on the first success while ignoring a straggler that merely fails:
type Attempt<T> = (signal: AbortSignal) => Promise<T>;
async function hedgedCall<T>(attempt: Attempt<T>, hedgeDelayMs: number): Promise<T> {
const controllers = new Set<AbortController>();
let hedgeTimer: ReturnType<typeof setTimeout> | undefined;
const launch = (): Promise<T> => {
const controller = new AbortController();
controllers.add(controller);
return attempt(controller.signal);
};
const hedge = new Promise<T>((resolve, reject) => {
hedgeTimer = setTimeout(() => launch().then(resolve, reject), hedgeDelayMs);
});
try {
// Resolves on the first success; rejects only if every attempt fails.
return await Promise.any([launch(), hedge]);
} finally {
if (hedgeTimer) clearTimeout(hedgeTimer);
for (const controller of controllers) controller.abort();
}
}
Two details make this correct. Promise.any resolves as soon as either attempt fulfills and only rejects — with an AggregateError — if both fail, so a fast primary failure does not short-circuit a hedge that would have succeeded. The finally block both clears the timer (so the hedge never launches if the primary already answered) and aborts every controller (so the losing in-flight request is cancelled rather than left running and billing you for compute).
Each attempt must honor the AbortSignal: pass it to fetch, your database driver, or the model SDK so a cancellation actually stops the work. Hedging without propagating cancellation still improves latency, but it doubles load on every hedge because the loser keeps running to completion.
Tune the Hedge Delay to Your Latency Curve
The hedge delay is the whole ballgame. Set it too low and you hedge almost every request, doubling load for little gain. Set it too high and slow requests finish before the hedge ever fires, so the tail stays fat. The sweet spot is a high percentile of your own latency distribution — commonly p95. At p95, roughly 95% of requests answer before the hedge fires and cost nothing extra, while the slowest 5% get a second chance.
That means you cannot hardcode a number; you have to measure it and let it move as your service does. A lightweight rolling window is enough to derive the delay at runtime:
class LatencyWindow {
private readonly samples: number[] = [];
constructor(private readonly capacity = 2048) {}
record(ms: number): void {
this.samples.push(ms);
if (this.samples.length > this.capacity) this.samples.shift();
}
percentile(p: number): number {
if (this.samples.length === 0) return 0;
const sorted = [...this.samples].sort((a, b) => a - b);
const index = Math.min(sorted.length - 1, Math.floor((p / 100) * sorted.length));
return sorted[index];
}
}
const window = new LatencyWindow();
// After each completed primary call: window.record(elapsedMs)
const hedgeDelayMs = Math.max(window.percentile(95), 25);
Sorting on every read is fine for illustration but wasteful at scale; a production system should use a streaming estimator such as a t-digest or an HDR histogram. Whatever you use, keep the delay adaptive per route or per dependency. A fast cache lookup and a slow analytics query have completely different curves, and a single global delay will over-hedge one while under-hedging the other. The floor (25 ms above) prevents a temporarily quiet service from setting an absurdly small delay that hedges everything.
Only Hedge Idempotent Work
Hedging duplicates a request, so it is only safe when running the request twice is indistinguishable from running it once. Reads qualify: a database SELECT, a cache fetch, a search query, an object GET, and — importantly for agents — a model inference call, which has no external side effects. Writes do not. Hedging "charge the customer" or "send the email" or "create the ticket" can execute the side effect twice, and cancelling the loser does not help, because the remote system may already have committed the change before your abort arrived.
For an AI agent, this maps cleanly onto the tool boundary. Read-only tools and the model call itself are safe to hedge; write tools are not. A small guard keeps the two paths honest:
type ToolCall = {
name: string;
idempotent: boolean;
run: (signal: AbortSignal) => Promise<unknown>;
};
async function callTool(tool: ToolCall, hedgeDelayMs: number): Promise<unknown> {
if (!tool.idempotent) {
// A write must run exactly once; hedging could duplicate the effect.
const controller = new AbortController();
return tool.run(controller.signal);
}
return hedgedCall(tool.run, hedgeDelayMs);
}
There is a middle ground for writes that carry an idempotency key. If both the primary and the hedge send the same key, a well-behaved downstream deduplicates them and the duplicate is harmless — this is the same guarantee that protects retries. If you already run agent write tools through idempotency keys or a command ledger, you can extend hedging to them deliberately. Absent that protection, treat every non-idempotent call as un-hedgeable and leave its tail to other tactics such as timeouts and circuit breakers.
Cap the Cost with a Hedge Budget
Hedging is self-defeating if it fires too often. A service that hedges 40% of requests is doing 40% more work, and if it starts hedging because a dependency is already overloaded, those duplicates pour fuel on the fire. The guardrail is a hedge budget: allow only a small fraction of requests — often 1% to 5% — to spawn a hedge, and refuse the rest.
A token-style budget enforces the cap cheaply. Each request contributes a fraction of a token, and each hedge costs a whole token, so the long-run hedge rate can never exceed the ratio you set:
class HedgeBudget {
private tokens = 0;
constructor(private readonly maxHedgeRatio = 0.05) {}
onRequest(): void {
this.tokens = Math.min(1, this.tokens + this.maxHedgeRatio);
}
tryClaim(): boolean {
if (this.tokens >= 1) {
this.tokens -= 1;
return true;
}
return false;
}
}
Call onRequest() for every request and only hedge when tryClaim() returns true, so a burst of slow requests cannot each spawn a duplicate. Combine the budget with load awareness: skip hedging entirely when a dependency's circuit breaker is open, when its queue is deep, or when you are already shedding load. Hedging assumes spare capacity on the replicas; when there is none, the correct move is to stop hedging, not to double down. The p95-based delay and the budget work together here — a healthy service rarely trips the delay, so it rarely spends budget, and a degraded service should back off on both fronts.
Measure Whether Hedging Helps
Hedging is easy to misconfigure and invisible when it goes wrong, so instrument it from the start. Track the hedge rate (fraction of requests that spawned a duplicate), the win rate (fraction of hedges where the second request actually beat the first), and the p99 latency with hedging on versus off. These three numbers tell you almost everything.
Read them together. A hedge rate near your budget ceiling means the delay is too low or a dependency is genuinely slow — investigate before raising the budget. A high win rate means the hedge frequently wins, which is good, unless the rate is so high it suggests the delay is set below the real p95 and you are hedging healthy requests. Most importantly, if p99 barely moves with hedging enabled, the tail is not coming from stragglers you can route around — it may be a systemic bottleneck like a saturated database or a lock, and hedging is just adding load without buying latency. Alert on a hedge rate that pins to the budget and on extra load that climbs without a matching drop in p99, because both signal that hedging has stopped helping and started hurting.
Conclusion and Next Steps
Hedged requests are a precise tool for a specific problem: successful-but-slow calls whose tail you cannot design away. Send a backup after a p95-length delay, take the first success, cancel the loser — and the slowest few percent of requests stop dominating your latency and your agent run times. The correctness rules are what keep it safe: propagate cancellation, hedge only idempotent work, and cap the extra load with a budget that backs off under stress.
Start with one read-heavy dependency or one slow agent tool. Measure its latency distribution, set the hedge delay at p95, and cap hedging at a few percent. Watch p99 and the win rate for a week, tune the delay, and only then extend hedging to the next dependency. Paired with timeouts, retries, and circuit breakers, hedging rounds out a resilience toolkit that treats slowness — not just failure — as a first-class fault.