ai agents

Designing Durable Agentic Workflows

Learn how to design agentic workflows that survive retries, crashes, tool failures, human approvals, and partial progress without losing intent.

Introduction

Agentic systems are easy to prototype as a loop: ask a model what to do, run a tool, append the result to the conversation, and repeat until the model says it is done. That loop is useful for a demo, but it is too fragile for production work that changes tickets, deploys code, sends email, moves money, or touches customer data.

The hard part is not making an agent call a tool. The hard part is preserving intent when the process crashes halfway through, a tool times out after performing the side effect, the model proposes an unsafe action, a human approval arrives hours later, or the workflow needs to resume after a deploy. Durable agentic workflows treat the agent as a stateful process with checkpoints, explicit transitions, idempotent side effects, and auditable recovery paths.

This article walks through a practical design for durable agent workflows. The examples use TypeScript, SQL, and queue-style execution, but the same principles apply whether you run agents in a job worker, workflow engine, serverless function, or long-lived service.

Model the Agent as a Workflow, Not a Chat Loop

A durable agent needs an explicit workflow model. The model does not need to be complicated, but it should separate intent, state, tool effects, and completion criteria. If all of that only exists in a prompt transcript, resuming after a failure becomes guesswork.

Start by naming the states the workflow can occupy:

type AgentRunStatus =
  | "queued"
  | "planning"
  | "waiting_for_tool"
  | "waiting_for_approval"
  | "recovering"
  | "completed"
  | "failed"
  | "cancelled";

type AgentStepKind =
  | "model_decision"
  | "tool_call"
  | "approval_request"
  | "approval_response"
  | "checkpoint";

type AgentRun = {
  id: string;
  objective: string;
  status: AgentRunStatus;
  currentStepId: string | null;
  version: number;
  createdAt: string;
  updatedAt: string;
};

This small type definition changes how the system behaves. A worker can resume a run by loading currentStepId. A dashboard can show whether the run is waiting on a tool or a person. A deploy can stop workers without losing which state the run was in. Support can inspect the state without replaying every model message by hand.

Keep the model response advisory

The model can propose the next action, but your workflow should decide whether that action is allowed. Treat model output as a request to transition state, not as the state transition itself.

For example, the model may propose:

{
  "action": "send_customer_email",
  "reason": "The customer asked for a summary of the incident.",
  "args": {
    "template": "incident-summary",
    "ticketId": "TCK-1024"
  }
}

The workflow still checks policy before running the tool:

  • Is this action allowed for this workflow type?
  • Does the tool require human approval?
  • Are the arguments complete and valid?
  • Has this exact side effect already been attempted?
  • Is the run still on the same version the worker loaded?

That boundary keeps autonomy useful without letting a prompt bypass application rules.

Persist Checkpoints and Side Effects

Durability starts with storing enough information to resume the workflow safely. At minimum, persist the run record, each step, each tool invocation, and the idempotency key for every external side effect.

A relational schema can be straightforward:

CREATE TABLE agent_runs (
  id uuid PRIMARY KEY,
  objective text NOT NULL,
  status text NOT NULL,
  current_step_id uuid,
  version integer NOT NULL DEFAULT 1,
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE agent_steps (
  id uuid PRIMARY KEY,
  run_id uuid NOT NULL REFERENCES agent_runs(id),
  kind text NOT NULL,
  status text NOT NULL,
  input jsonb NOT NULL,
  output jsonb,
  error jsonb,
  created_at timestamptz NOT NULL DEFAULT now(),
  completed_at timestamptz
);

CREATE TABLE agent_tool_calls (
  id uuid PRIMARY KEY,
  run_id uuid NOT NULL REFERENCES agent_runs(id),
  step_id uuid NOT NULL REFERENCES agent_steps(id),
  tool_name text NOT NULL,
  idempotency_key text NOT NULL UNIQUE,
  request jsonb NOT NULL,
  response jsonb,
  status text NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  completed_at timestamptz
);

The unique idempotency_key is the difference between safe recovery and duplicate side effects. If a worker crashes after creating a GitHub issue but before marking the step complete, the retried worker should discover the prior tool call and reconcile it instead of opening another issue.

Checkpoint after decisions and effects

Checkpoint before an expensive or risky action so the system knows what it intended to do. Checkpoint after the action so the system knows what happened. That gives recovery code a clear path:

  • If there is a planned tool call with no execution record, it can execute the tool.
  • If there is an execution record with no response, it can query the external system or retry with the same idempotency key.
  • If there is a response but the run state was not advanced, it can advance the workflow without repeating the side effect.

This is the same reliability habit used in payment systems, event relays, and deployment tools: write intent, perform work, record the result, then advance state.

Make Tool Calls Recoverable

Tool calls are where agent workflows usually become unsafe. A read-only search tool can be retried freely. A write tool that comments on a pull request, deploys a service, or emails a user cannot.

Wrap every write tool in an execution boundary that records intent before the external call and records the result after it:

async function executeToolCall({
  db,
  runId,
  stepId,
  toolName,
  request,
  callTool,
}: {
  db: Database;
  runId: string;
  stepId: string;
  toolName: string;
  request: unknown;
  callTool: (request: unknown, idempotencyKey: string) => Promise<unknown>;
}) {
  const idempotencyKey = `${runId}:${stepId}:${toolName}`;

  const existing = await db.toolCalls.findByIdempotencyKey(idempotencyKey);
  if (existing?.status === "completed") {
    return existing.response;
  }

  await db.toolCalls.upsertStarted({
    runId,
    stepId,
    toolName,
    idempotencyKey,
    request,
  });

  try {
    const response = await callTool(request, idempotencyKey);

    await db.toolCalls.markCompleted({
      idempotencyKey,
      response,
    });

    return response;
  } catch (error) {
    await db.toolCalls.markFailed({
      idempotencyKey,
      error: serializeError(error),
    });

    throw error;
  }
}

The important detail is not the exact database wrapper. The important detail is that retries use the same idempotency key and consult stored tool state before repeating work.

Reconcile uncertain outcomes

Not every tool supports idempotency natively. Some APIs time out after they already performed the work. Some systems accept an idempotency key only for a short window. Some tools have no query endpoint that can prove whether the side effect happened.

For those tools, add a reconciliation strategy:

  • Store a natural external identifier, such as a ticket key, deployment ID, or message ID.
  • Query the external system before retrying a write.
  • Prefer deterministic resource names when creating external records.
  • Mark the step as recovering when the outcome cannot be proven automatically.
  • Require human review before repeating a high-impact side effect.

The workflow should have a first-class "uncertain" path. Pretending an uncertain write is a normal transient failure is how duplicate comments, duplicate emails, duplicate refunds, and conflicting deployments appear.

Add Human Gates and Policy Boundaries

Durable does not mean fully autonomous. Many useful agent workflows should pause for human approval before crossing a boundary. Examples include sending external communication, deleting data, merging a pull request, changing infrastructure, granting access, or spending budget.

An approval gate should be a workflow state, not a chat message hidden in context. Store who can approve, what exactly they are approving, and what will happen next:

type ApprovalRequest = {
  id: string;
  runId: string;
  stepId: string;
  requestedAction: string;
  summary: string;
  riskLevel: "low" | "medium" | "high";
  approvers: string[];
  status: "pending" | "approved" | "rejected" | "expired";
  expiresAt: string;
};

When the approval arrives, resume from the stored workflow state instead of asking the model to infer what happened. The approval response becomes another event in the run history. That creates a clear audit trail:

  • The model proposed an action.
  • The workflow classified the action as approval-required.
  • A specific person approved or rejected a specific request.
  • The worker resumed and executed the next step.

Policy checks should run both before and after model decisions. Before the decision, they constrain available tools and data. After the decision, they validate the proposed action. This double check helps when context changes while the run is paused, such as a ticket being closed, a deploy freeze starting, or a user's permissions changing.

Test Failure Modes Before You Trust Autonomy

An agent workflow is not durable because it has a database table. It is durable when the failure tests prove that retries, resumes, and cancellations preserve intent.

Write tests that kill the worker at awkward points:

it("does not repeat a completed write tool after worker restart", async () => {
  const run = await createRun({
    objective: "Open a follow-up ticket for incident INC-42",
  });

  const tool = createFakeTicketTool();

  await runUntilAfterToolResponse({
    runId: run.id,
    tool,
    crashBeforeCheckpoint: true,
  });

  await resumeRun({
    runId: run.id,
    tool,
  });

  expect(tool.createdTickets).toHaveLength(1);

  const completed = await db.toolCalls.findCompletedByRun(run.id);
  expect(completed).toHaveLength(1);
  expect(completed[0].response.ticketKey).toBe(tool.createdTickets[0].key);
});

Add tests for the cases operators actually fear:

  • A model proposes a disallowed tool.
  • A tool times out after performing the side effect.
  • A worker crashes after storing intent but before calling the tool.
  • A worker crashes after the tool response but before advancing the run.
  • A human approval arrives after the run expires.
  • Two workers try to resume the same run at the same time.
  • A cancellation arrives while a tool call is in progress.

The concurrency case deserves special attention. Use optimistic concurrency on the agent_runs.version field so only one worker can advance a run from a specific state. If a second worker loses the update race, it should reload the run and decide whether any work remains.

Conclusion and Next Steps

Durable agentic workflows are less about exotic AI behavior and more about ordinary distributed-systems discipline. Persist intent, checkpoint progress, make tool calls idempotent, model human approvals explicitly, and test recovery paths before users depend on the workflow.

Start with one workflow that has real value and real risk. Draw its states, list its write tools, assign idempotency keys, add an approval gate for high-impact actions, and write restart tests around the scariest failure points. Once that workflow can resume cleanly after crashes and timeouts, the agent becomes a reliable worker instead of a fragile chat loop.