reliability

Designing Retry Strategies with Backoff and Jitter

Learn how to retry transient failures without amplifying outages by combining timeouts, backoff, jitter, budgets, and observability.

June 4, 2026 17 min read 6023 words

Introduction

Retries are one of the simplest reliability tools in a backend system. A network packet drops, a database primary fails over, a service returns a temporary 503, and one more attempt can turn a user-visible failure into a normal response. That is the useful side of retries.

The dangerous side is that retries multiply traffic during the exact moment a dependency is already struggling. A client that retries three times can turn one request into four. A fleet of workers using the same retry delay can create synchronized bursts. A timeout without a deadline can keep work alive long after the user has given up.

Good retry design is not "try again until it works." It is a bounded policy that retries only recoverable failures, waits in a way that protects dependencies, stops before the caller's deadline, and emits enough telemetry to explain whether the policy is helping.

Retry Only Failures That Can Recover

The first design decision is not the delay. It is whether a retry should happen at all.

Retry transient failures:

Network connection resets.
DNS or TLS handshakes that fail before a request reaches the server.
408 Request Timeout.
429 Too Many Requests, especially when the response includes retry guidance.
500, 502, 503, and 504 when the operation is safe to repeat.

Avoid retrying deterministic failures:

400 Bad Request.
401 Unauthorized and 403 Forbidden unless a token refresh changes the request.
404 Not Found for resources that should already exist.
Validation errors and schema mismatches.
Writes that are not idempotent and do not use an idempotency key.

A small classifier makes the policy explicit:

function isRetryableHttpStatus(status) {
  return status === 408 ||
    status === 429 ||
    status === 500 ||
    status === 502 ||
    status === 503 ||
    status === 504;
}

function isRetryableNetworkError(error) {
  return [
    "ECONNRESET",
    "ECONNREFUSED",
    "EHOSTUNREACH",
    "ENETUNREACH",
    "ETIMEDOUT",
  ].includes(error.code);
}

function shouldRetry({ error, response, method, hasIdempotencyKey }) {
  const safeMethod = ["GET", "HEAD", "OPTIONS", "DELETE"].includes(method);
  const repeatableWrite = ["PUT", "PATCH", "POST"].includes(method) && hasIdempotencyKey;

  if (!safeMethod && !repeatableWrite) {
    return false;
  }

  if (error && isRetryableNetworkError(error)) {
    return true;
  }

  return response ? isRetryableHttpStatus(response.status) : false;
}

The exact list depends on your product. For example, retrying 404 may be reasonable in eventually consistent systems where a newly created resource appears after a short delay. The important part is that the exception is deliberate, documented, and tested.

Use Backoff and Jitter to Avoid Retry Storms

Immediate retries are tempting because they reduce latency when a failure clears quickly. They also create retry storms. If every caller retries immediately, the dependency receives a burst of new traffic before it has recovered.

Exponential backoff spaces attempts apart:

Attempt 1 waits 100 ms.
Attempt 2 waits 200 ms.
Attempt 3 waits 400 ms.
Attempt 4 waits 800 ms.

That pattern still has a synchronization problem. If thousands of clients started at the same time, they will retry at the same intervals. Jitter adds randomness so retries spread out across the delay window.

function delayWithFullJitter({ attempt, baseMs, capMs }) {
  const exponentialDelay = Math.min(capMs, baseMs * 2 ** attempt);
  return Math.floor(Math.random() * exponentialDelay);
}

function delayWithEqualJitter({ attempt, baseMs, capMs }) {
  const exponentialDelay = Math.min(capMs, baseMs * 2 ** attempt);
  const halfDelay = exponentialDelay / 2;
  return Math.floor(halfDelay + Math.random() * halfDelay);
}

Full jitter is a good default when protecting a dependency matters more than shaving a few milliseconds from the fastest recovery path. Equal jitter keeps a minimum delay while still avoiding synchronized retries. Fixed delays are usually acceptable only for small internal loops where the number of callers is tightly controlled.

Put Retries Inside a Deadline

A retry policy needs a maximum number of attempts, but attempts alone are not enough. Three retries with a long timeout can outlive the original user request and waste capacity.

Start with the caller's total deadline, then divide the time budget:

The user-facing request has 2 seconds.
The first attempt gets 500 ms.
The second attempt waits with jitter, then gets 500 ms.
The final attempt uses whatever time remains.
No retry starts if it cannot finish before the deadline.

This keeps retry behavior aligned with the business operation. If the user has already received an error, extra work is usually noise unless a background process still needs to complete it.

function nowMs() {
  return Date.now();
}

function remainingMs(deadlineMs) {
  return Math.max(0, deadlineMs - nowMs());
}

async function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function withAttemptTimeout(fn, timeoutMs) {
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), timeoutMs);

  try {
    return await fn({ signal: controller.signal });
  } finally {
    clearTimeout(timer);
  }
}

The helper above gives each attempt its own cancellation signal. The calling retry loop still needs to decide when another attempt is worth starting.

Build a Bounded Retry Wrapper

The wrapper below combines the pieces: retry classification, jittered backoff, per-attempt timeout, and a total deadline. It is intentionally small enough to adapt to fetch, a database client, or an internal SDK.

async function retryWithBackoff(operation, options) {
  const {
    maxAttempts = 4,
    baseDelayMs = 100,
    maxDelayMs = 2000,
    attemptTimeoutMs = 500,
    totalTimeoutMs = 2500,
    shouldRetry,
    onRetry = () => {},
  } = options;

  const deadlineMs = Date.now() + totalTimeoutMs;
  let lastError;

  for (let attempt = 0; attempt < maxAttempts; attempt += 1) {
    const timeLeft = remainingMs(deadlineMs);
    if (timeLeft <= 0) {
      break;
    }

    try {
      return await withAttemptTimeout(
        operation,
        Math.min(attemptTimeoutMs, timeLeft),
      );
    } catch (error) {
      lastError = error;

      if (attempt === maxAttempts - 1 || !shouldRetry(error)) {
        throw error;
      }

      const delayMs = delayWithFullJitter({
        attempt,
        baseMs: baseDelayMs,
        capMs: maxDelayMs,
      });

      if (delayMs >= remainingMs(deadlineMs)) {
        break;
      }

      onRetry({ attempt: attempt + 1, delayMs, error });
      await sleep(delayMs);
    }
  }

  throw lastError ?? new Error("retry deadline exceeded");
}

Used with HTTP, the operation can convert non-success responses into typed errors:

async function callInventoryService(productId) {
  return retryWithBackoff(
    async ({ signal }) => {
      const response = await fetch(`https://inventory.internal/products/${productId}`, {
        method: "GET",
        signal,
      });

      if (!response.ok) {
        const error = new Error(`inventory returned ${response.status}`);
        error.status = response.status;
        throw error;
      }

      return response.json();
    },
    {
      totalTimeoutMs: 1800,
      attemptTimeoutMs: 450,
      maxAttempts: 4,
      shouldRetry: (error) => isRetryableHttpStatus(error.status) ||
        isRetryableNetworkError(error),
      onRetry: ({ attempt, delayMs, error }) => {
        console.info("retrying inventory request", {
          attempt,
          delayMs,
          status: error.status,
          code: error.code,
        });
      },
    },
  );
}

For writes, use an idempotency key so the server can recognize repeated attempts for the same logical action:

async function createPayment(payment) {
  const idempotencyKey = crypto.randomUUID();

  return retryWithBackoff(
    async ({ signal }) => {
      const response = await fetch("https://payments.internal/payments", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "Idempotency-Key": idempotencyKey,
        },
        body: JSON.stringify(payment),
        signal,
      });

      if (!response.ok) {
        const error = new Error(`payment failed with ${response.status}`);
        error.status = response.status;
        throw error;
      }

      return response.json();
    },
    {
      maxAttempts: 3,
      totalTimeoutMs: 3000,
      shouldRetry: (error) => isRetryableHttpStatus(error.status),
    },
  );
}

The key must stay the same across attempts for the same payment. Generating a new key for each retry turns deduplication off.

Add Retry Budgets and Observability

Retries should have a budget at the service level, not only inside one function. A retry budget limits how much extra traffic a service is allowed to generate. For example, a team might allow retry traffic to be at most 10 percent of successful original request traffic over a rolling window.

Budgets prevent a subtle failure mode: every individual caller follows a reasonable local policy, but the whole fleet still overwhelms a dependency. When the budget is exhausted, callers should fail fast, degrade gracefully, or enqueue work for later instead of adding more immediate retries.

Useful metrics include:

Original requests by dependency, route, and caller.
Retry attempts by dependency, route, caller, and reason.
Retry success rate.
Added latency from retries.
Attempts abandoned because the deadline expired.
Attempts blocked because the retry budget was exhausted.

Logs should include the attempt number, delay, error class, deadline remaining, and idempotency key only when it is safe to expose. Traces should represent retries as child spans or annotated events so operators can see when latency came from waiting rather than execution.

Respect server guidance

When an API returns Retry-After, use it as an upper-level signal. The response is telling callers when retrying may be safe. Combine it with your own deadline and jitter instead of blindly sleeping beyond the caller's useful time budget.

function retryDelayFromResponse(response, fallbackDelayMs) {
  const retryAfter = response.headers.get("Retry-After");
  if (!retryAfter) {
    return fallbackDelayMs;
  }

  const seconds = Number(retryAfter);
  if (Number.isFinite(seconds)) {
    return Math.max(0, seconds * 1000);
  }

  const retryAt = Date.parse(retryAfter);
  if (Number.isNaN(retryAt)) {
    return fallbackDelayMs;
  }

  return Math.max(0, retryAt - Date.now());
}

Still cap the result. A server hint that exceeds your deadline should usually become a fast failure, a queued retry, or a user-facing "try again later" response.

Common Failure Modes

Retry policies often fail because they optimize a single request without considering the system.

Common mistakes include:

Retrying every POST without idempotency keys.
Retrying after the user request has timed out.
Applying retries at multiple layers, such as SDK, service, proxy, and queue, without a shared limit.
Using the same fixed delay across an entire fleet.
Treating 429 as an invitation to retry aggressively.
Hiding retries from metrics, which makes dependency saturation look mysterious.
Retrying deterministic validation errors instead of fixing the caller.

The fix is usually not more code. It is a clearer contract: which failures are transient, how long the caller still cares, which operations are repeatable, and how much extra traffic the system can afford.

Conclusion and Next Steps

Retries are useful when they are specific, bounded, and observable. Start by classifying retryable failures, then add exponential backoff with jitter. Put every attempt inside the caller's deadline, protect writes with idempotency keys, and track retry traffic as a first-class production signal.

A good next step is to audit one dependency call that already retries today. Check whether it has jitter, a total deadline, a retryable-error classifier, useful telemetry, and an answer for non-idempotent writes. That small audit usually reveals whether retries are making the system more resilient or just louder during outages.