Designing Circuit Breakers for Distributed Services
Learn how to stop cascading failures with circuit breakers that open on real dependency pain, probe recovery safely, and expose clear fallbacks.
Introduction
Distributed systems fail in uneven ways. A search service can start timing out while the database is healthy. A payment provider can return intermittent 503 responses while the rest of the checkout flow is fine. A downstream API can become so slow that callers keep waiting, worker pools fill up, and unrelated requests begin to fail.
A circuit breaker protects callers from repeatedly spending capacity on a dependency that is already failing. When the dependency looks healthy, the circuit is closed and requests pass through. When enough calls fail or time out, the circuit opens and new calls fail fast or use a fallback. After a cool-down period, the circuit allows a small number of probes in a half-open state to see whether the dependency has recovered.
Circuit breakers are not a replacement for timeouts, retries, rate limits, bulkheads, or backpressure. They need those patterns around them. A timeout defines when a call is too slow. A retry policy handles occasional transient failures. A bulkhead limits how much capacity the dependency can consume. The circuit breaker decides when continued attempts are making the situation worse.
This article walks through practical circuit breaker design for backend services: which failures should count, how to model states, how to choose fallback behavior, how to tune thresholds, and which metrics make the breaker understandable during incidents.
Open on Dependency Pain, Not Every Error
The first mistake is counting every exception as a reason to open the circuit. A circuit breaker should protect a dependency from traffic when the dependency is unavailable, overloaded, or too slow. It should not hide caller bugs, validation errors, or authorization failures.
Useful failure signals include:
- Connection resets, refused connections, DNS failures, and TLS failures.
500,502,503, and504responses from the dependency.429responses when the dependency is rate limiting your service.- Call timeouts and deadline cancellations.
- High latency when slow calls consume the same capacity as failed calls.
Signals that usually should not count include:
400 Bad Request, because the caller sent invalid input.401 Unauthorizedor403 Forbidden, unless the dependency outage manifests through those responses.404 Not Foundfor normal lookup misses.- Domain conflicts such as insufficient balance, duplicate username, or stale version token.
A small classifier keeps this decision visible:
function isCircuitBreakerFailure(result) {
if (result.error) {
return [
"ECONNRESET",
"ECONNREFUSED",
"EHOSTUNREACH",
"ENETUNREACH",
"ETIMEDOUT",
].includes(result.error.code);
}
if (!result.response) {
return false;
}
return [
429,
500,
502,
503,
504,
].includes(result.response.status);
}
The exact list depends on the dependency. A cache lookup miss may be a normal 404. A user-service 404 during checkout may mean the request is invalid. The important part is that the breaker reflects dependency health, not general business outcomes.
Require enough samples
A breaker that opens after one failed call is usually too sensitive. One unlucky network packet can disable a healthy dependency. Use a minimum request count before calculating failure rate.
For example:
- Do not evaluate until at least 20 calls have completed in the current window.
- Open if 50 percent or more of those calls failed.
- Count slow calls as failures when they exceed the caller deadline.
- Keep the window short enough to react, but long enough to avoid noise.
Low-traffic dependencies need special care. If a service only receives one call per minute, a percentage-based breaker can take too long to open or stay open too long after recovery. In that case, use explicit consecutive failure limits, manual controls, or dependency health signals alongside the rolling window.
Model Closed, Open, and Half-Open States
A circuit breaker should make its state transitions explicit. Hidden state makes incidents harder to debug and makes tests harder to write.
The basic state machine is:
closed -> open too many failures
open -> half-open cool-down elapsed
half-open -> closed probe succeeds
half-open -> open probe fails
Here is a compact JavaScript implementation that shows the mechanics:
class CircuitOpenError extends Error {
constructor(name) {
super(`circuit open: ${name}`);
this.name = "CircuitOpenError";
}
}
class CircuitBreaker {
constructor({
name,
failureThreshold = 0.5,
minimumSamples = 20,
resetAfterMs = 10_000,
}) {
this.name = name;
this.failureThreshold = failureThreshold;
this.minimumSamples = minimumSamples;
this.resetAfterMs = resetAfterMs;
this.state = "closed";
this.openedAt = 0;
this.window = [];
}
async execute(task) {
if (this.state === "open") {
const elapsed = Date.now() - this.openedAt;
if (elapsed < this.resetAfterMs) {
throw new CircuitOpenError(this.name);
}
this.state = "half-open";
}
try {
const value = await task();
this.record({ failed: false });
return value;
} catch (error) {
this.record({ failed: true });
throw error;
}
}
record(sample) {
if (this.state === "half-open") {
if (sample.failed) {
this.open();
} else {
this.close();
}
return;
}
this.window.push(sample);
this.window = this.window.slice(-this.minimumSamples);
if (this.window.length < this.minimumSamples) {
return;
}
const failures = this.window.filter((item) => item.failed).length;
const failureRate = failures / this.window.length;
if (failureRate >= this.failureThreshold) {
this.open();
}
}
open() {
this.state = "open";
this.openedAt = Date.now();
}
close() {
this.state = "closed";
this.openedAt = 0;
this.window = [];
}
}
This example is intentionally small. A production breaker usually needs a time-based rolling window, separate counters for success, failure, timeout, rejection, and fallback, plus concurrency limits around half-open probes. The shape is still the same: reject while open, probe cautiously, and close only after recovery evidence.
Pair Breakers with Timeouts and Fallbacks
A breaker cannot know a dependency is slow unless each call has a deadline. Without timeouts, calls can hang until every worker, thread, or connection is occupied. The breaker may eventually open, but the service has already spent the capacity it was trying to protect.
Put a deadline around the protected operation:
async function withTimeout(operation, timeoutMs) {
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeoutMs);
try {
return await operation(controller.signal);
} finally {
clearTimeout(timer);
}
}
Then use the breaker at the dependency boundary, not scattered across every route handler:
const recommendationsBreaker = new CircuitBreaker({
name: "recommendations-api",
failureThreshold: 0.5,
minimumSamples: 30,
resetAfterMs: 15_000,
});
async function getRecommendations(userId) {
try {
return await recommendationsBreaker.execute(() =>
withTimeout((signal) =>
fetchRecommendations({ userId, signal }),
700,
),
);
} catch (error) {
if (error.name === "CircuitOpenError" || error.name === "AbortError") {
return getCachedRecommendations(userId);
}
throw error;
}
}
The fallback here returns cached recommendations because the feature can degrade safely. That is not always true. A payment authorization should not silently return a fake success. A permission check should not default to allowing access. A circuit breaker needs a fallback policy that matches the business risk of the operation.
Choose the fallback deliberately
Common fallback options include:
- Return stale cached data with a visible freshness limit.
- Omit an optional section from the response.
- Queue work for later if the operation is safe to delay.
- Return a clear
503withRetry-After. - Route to a secondary provider if the data model and failure modes are understood.
The dangerous fallback is one that hides correctness problems. If the dependency answers "can this user perform this action?", failing open may become a security issue. If the dependency charges a card, replaying later without idempotency may double charge a customer. For critical writes, fail fast and preserve the user's intent rather than pretending the dependency succeeded.
Tune Thresholds for Traffic Shape
Circuit breaker settings are production contracts. They decide when your service stops calling another service. Do not copy thresholds without checking traffic, latency, and business impact.
Start with these inputs:
- Normal request rate to the dependency.
- Expected p95 and p99 latency during healthy periods.
- Maximum caller deadline.
- Dependency rate limits and concurrency limits.
- Cost of a false open, where healthy traffic is blocked.
- Cost of a false closed, where failing traffic continues.
High-volume dependencies can use short rolling windows because they collect enough samples quickly. Low-volume dependencies may need longer windows, consecutive failure limits, or manual overrides. Expensive operations may deserve aggressive opening because every failed call costs meaningful capacity. Critical operations may need more conservative opening plus reserved capacity and better fallbacks.
Avoid flapping
Flapping happens when a breaker opens, probes once, closes, receives a burst of traffic, and immediately opens again. It creates unstable behavior for callers and noisy alerts for operators.
Reduce flapping with:
- A cool-down period long enough for the dependency to recover.
- A half-open limit of one or a few probe calls.
- Several successful probes before closing for high-risk dependencies.
- A ramp-up period after close, especially for dependencies that need warm caches.
- Separate breakers per dependency and operation class.
Half-open traffic should be tiny compared with normal traffic. The point is to test recovery, not to resume full load before evidence exists.
Observe the Breaker as a Product Surface
When a circuit breaker opens, users may see missing recommendations, delayed exports, failed checkout steps, or partial search results. Treat breaker telemetry as part of the product's reliability surface, not only as internal library metrics.
At minimum, record:
- Current state by breaker name.
- State transitions with reason and duration.
- Calls allowed, rejected, timed out, failed, and succeeded.
- Fallback count and fallback type.
- Half-open probe count and result.
- Dependency latency distribution.
- Caller route, tenant, or job type when relevant.
A simple event hook makes state changes visible:
function emitBreakerEvent({ breaker, from, to, reason }) {
logger.info({
event: "circuit_breaker_transition",
breaker,
from,
to,
reason,
changedAt: new Date().toISOString(),
});
metrics.increment("circuit_breaker.transition", {
breaker,
from,
to,
reason,
});
}
Dashboards should answer operational questions quickly:
- Which dependency is causing the most open circuits?
- Are callers receiving fallback responses or hard failures?
- Did the breaker open before or after latency rose?
- Are retries increasing after breaker rejections?
- Is one tenant or route driving the failures?
Alerting on every open event can be too noisy. Alert on user impact and sustained behavior: a critical breaker open for several minutes, fallback rate above a threshold, half-open probes failing repeatedly, or a sudden increase in breaker rejections for a high-value path.
Test the uncomfortable paths
Circuit breakers are easy to demo and easy to misconfigure. Test the paths that only happen during dependency trouble:
- The dependency times out and the breaker opens after the expected sample count.
- Open circuits fail fast without starting outbound calls.
- Half-open allows only the configured number of probes.
- A successful probe closes the breaker, or several probes close it when required.
- A failed probe reopens the breaker.
- Fallback responses are correct, safe, and observable.
- Retries do not keep hammering an open circuit.
These tests should run below the route layer when possible. A dependency wrapper with deterministic fake clocks and fake transport errors is easier to test than a full integration environment, and it protects every caller that uses the wrapper.
Conclusion
Circuit breakers keep dependency failures from becoming service-wide failures. They stop wasteful calls when a dependency is already unhealthy, give downstream systems time to recover, and let callers choose explicit fallback behavior instead of hanging behind slow operations.
Start with one dependency that has caused incidents or slowdowns before. Add timeouts first, define which failures count, wrap calls behind a named breaker, expose metrics for state transitions and fallbacks, and test open and half-open behavior with fake dependency failures. Once the breaker is predictable, combine it with retries, bulkheads, and backpressure so the whole service has a coherent overload strategy.