Articles tagged reliability

July 4, 2026 · 1 min read

Taming Tail Latency with Hedged Requests

Learn how to cut p99 latency in APIs and AI agents with hedged requests: tuning the hedge delay, hedging only idempotent work, and capping the extra load.

July 1, 2026 · 1 min read

Resource Budgets for Tool-Using AI Agents

Learn how to stop runaway AI agents with token budgets, cost ceilings, step limits, wall-clock deadlines, loop detection, and graceful degradation.

June 19, 2026 · 1 min read

Compensating Actions for Tool-Using AI Agents

Learn how to make AI agent side effects safer with compensating actions, recovery policies, idempotent undo steps, and operator-ready audit trails.

June 13, 2026 · 1 min read

Agent Command Ledgers for Reliable AI Workflows

Learn how to make AI agent side effects recoverable with command ledgers, fenced execution, reconciliation jobs, and replay-safe workflows.

June 12, 2026 · 1 min read

Sandboxing Tool-Using AI Agents

Learn how to run tool-using AI agents behind capability manifests, policy gates, sandboxes, audit logs, and recovery controls.

June 11, 2026 · 1 min read

Designing Durable Agentic Workflows

Learn how to design agentic workflows that survive retries, crashes, tool failures, human approvals, and partial progress without losing intent.

June 6, 2026 · 1 min read

Designing Circuit Breakers for Distributed Services

Learn how to stop cascading failures with circuit breakers that open on real dependency pain, probe recovery safely, and expose clear fallbacks.

June 6, 2026 · 1 min read

Designing Bulkheads for Resilient Services

Learn how to isolate service capacity with bulkheads so one slow dependency, tenant, queue, or feature cannot exhaust the whole system.

June 5, 2026 · 1 min read

Designing Load Shedding and Backpressure for APIs

Learn how to protect APIs during overload with admission control, bounded queues, backpressure signals, and clear degradation rules.

June 4, 2026 · 1 min read

Designing Retry Strategies with Backoff and Jitter

Learn how to retry transient failures without amplifying outages by combining timeouts, backoff, jitter, budgets, and observability.

June 3, 2026 · 1 min read

Graceful Shutdown for Node.js Services

Learn how to drain HTTP requests, stop background work, close dependencies, and make Node.js deployments terminate safely.

June 2, 2026 · 1 min read

Optimistic Concurrency Control for APIs and Databases

Learn how to prevent lost updates with version columns, ETags, compare-and-swap writes, and useful conflict responses.

June 1, 2026 · 1 min read

Practical API Rate Limiting with Token Buckets

Learn how to design token-bucket API rate limits that protect services without punishing normal users.

May 31, 2026 · 1 min read

Zero-Downtime Database Migrations with Expand and Contract

Learn how to ship database schema changes safely with expand-contract migrations, batched backfills, compatible application deploys, and clear rollback points.

May 28, 2026 · 1 min read

Implementing the Transactional Outbox Pattern for Reliable Events

Learn how the transactional outbox pattern keeps database writes and event publication consistent without distributed transactions.

May 27, 2026 · 1 min read

Designing Dead-Letter Queues That Help You Recover Events

Learn how to design dead-letter queues with useful metadata, triage workflows, safe replay tools, and clear ownership so failed events can be recovered instead of ignored.

May 26, 2026 · 1 min read

Implementing Idempotency Keys in APIs to Prevent Duplicate Actions

Learn how idempotency keys prevent duplicate side effects in retry-heavy clients by combining request fingerprinting, state tracking, and careful concurrency handling.

March 18, 2023 · 1 min read

Building Resilient Distributed Systems with Chaos Engineering

Learn how to use Chaos Engineering to make your distributed systems more resilient and reliable.