At re:Invent 2025, AWS announced Lambda Durable Functions — the ability to write multi-step, stateful workflows directly in Lambda code with automatic checkpointing, suspension for up to one year, and failure recovery. No Step Functions state machine. No SQS queues stitching Lambdas together. Just Python or Node.js code that can pause and resume.

This is the biggest Lambda feature since custom runtimes. It directly addresses the most common reason teams reach for Step Functions: coordinating multi-step workflows where each step depends on the previous one. But it doesn't replace Step Functions entirely. The two tools solve overlapping but different problems.

Here's how Durable Functions work, when to use them, and when Step Functions is still the right call.

What Durable Functions actually do

A durable function is a regular Lambda function that can call ctx.checkpoint() to save its execution state. When the function hits a checkpoint, Lambda serializes the current state, suspends the function, and frees the compute. When the function needs to continue — after a timer, an event, or an API call — Lambda restores the state and resumes from exactly where it left off.

Here's a minimal example in Python:

import aws_lambda_durable as durable @durable.function def process_order(ctx, event): order_id = event['order_id'] # Step 1: Validate the order validation = validate_order(order_id) ctx.checkpoint() # Step 2: Charge payment (might take seconds or days for approval) payment = charge_payment(order_id, validation['total']) ctx.checkpoint() # Step 3: Fulfill the order fulfillment = fulfill_order(order_id, payment['transaction_id']) ctx.checkpoint() # Step 4: Send confirmation send_confirmation(order_id, fulfillment['tracking_number']) return {'status': 'completed', 'order_id': order_id}

That's a four-step workflow in a single function. Each ctx.checkpoint() saves the state. If the function fails at step 3, it resumes from the last checkpoint — step 2's result is already saved, so it doesn't re-charge the payment. If step 2 needs to wait for a manual approval, the function suspends for hours or days without consuming compute.

Key capabilities

Supported runtimes

Currently available for Python 3.13, Python 3.14, Node.js 22, and Node.js 24. GA in US East (N. Virginia and Ohio). Global rollout expected Q2 2026.

The AI agent use case

The killer use case that AWS highlighted — and the one that makes Durable Functions genuinely new rather than just convenient — is AI agent orchestration.

An AI agent workflow typically looks like this: receive a user request, call an LLM, parse the response, call a tool or API based on the response, call the LLM again with the tool result, repeat until done. Each LLM call might take 5-30 seconds. The total workflow might involve 10-20 steps. Some steps require human approval.

Before Durable Functions, you had two options: keep a Lambda running for the entire duration (expensive, hits the 15-minute timeout), or orchestrate with Step Functions (works but requires translating your agent logic into a state machine DSL). Durable Functions let you write the agent loop as normal code:

@durable.function def ai_agent(ctx, event): messages = [{'role': 'user', 'content': event['prompt']}] max_iterations = 20 for i in range(max_iterations): # Call the LLM response = call_llm(messages) ctx.checkpoint() # Check if the agent wants to use a tool if response.get('tool_call'): tool_result = execute_tool(response['tool_call']) ctx.checkpoint() messages.append({'role': 'tool', 'content': tool_result}) else: # Agent is done return {'result': response['content']} return {'result': 'max iterations reached'}

Each LLM call and tool execution gets checkpointed. If the function fails or times out, it resumes from the last successful step. If a tool call requires human approval, the function suspends until the approval comes in. The total execution might span minutes or hours, but you only pay for the seconds of actual compute.

How to deploy it

Durable Functions are deployed like regular Lambda functions with a few additional configuration properties. Here's the CloudFormation:

AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Resources: OrderProcessor: Type: AWS::Serverless::Function Properties: Runtime: python3.14 Handler: handler.process_order CodeUri: ./src Timeout: 900 MemorySize: 256 DurableFunction: Enabled: true MaxSuspensionSeconds: 86400 CheckpointStorage: Type: S3 BucketArn: !GetAtt CheckpointBucket.Arn Policies: - S3CrudPolicy: BucketName: !Ref CheckpointBucket CheckpointBucket: Type: AWS::S3::Bucket Properties: LifecycleConfiguration: Rules: - ExpirationInDays: 30 Status: Enabled

The key additions: DurableFunction.Enabled: true, a MaxSuspensionSeconds limit, and checkpoint storage configuration. AWS manages the serialization; you manage the storage bucket and its lifecycle rules.

When to use Durable Functions vs. Step Functions

They're not interchangeable. Here's the decision framework:

Requirement Durable Functions Step Functions
Sequential multi-step workflow Yes Yes
Logic lives in code (not DSL) Yes No — ASL/JSON
Parallel branches Limited Yes — native
Visual workflow editor No Yes
Wait for external event Yes (up to 1 year) Yes (up to 1 year)
Complex branching/choice logic Code — if/else ASL Choice state
Map over array (fan-out) Manual Native Map state
Error handling with retries Code — try/except Built-in Retry/Catch
AI agent loops Natural fit Possible but awkward
Cross-service orchestration Via SDK calls Native AWS integrations
Audit/compliance visibility CloudWatch logs Built-in execution history

Use Durable Functions when:

Use Step Functions when:

Gotchas and limitations

Durable Functions are new, and the edges are rough in a few places:

The bottom line

Lambda Durable Functions don't replace Step Functions. They replace the pattern of using Step Functions when all you really needed was a Lambda function that could pause and resume. That pattern is extremely common — it covers most sequential workflows, most human-in-the-loop approvals, and most AI agent loops.

If you've ever built a Step Functions state machine that was essentially "call Lambda A, then call Lambda B, then call Lambda C" with some error handling, Durable Functions let you write that as a single function with checkpoints. The code is simpler, the deployment is simpler, and the debugging is simpler.

For anything with parallel execution, visual workflows, or deep AWS service integration, Step Functions remains the right tool. But for the 60-70% of workflows that are "do these things in order," Durable Functions are the better fit. And for AI agent orchestration specifically, they're not just better — they're the obvious choice.

Published by Yaw Labs.

Related Articles

Interested in AI tools and developer workflows? Token Limit News is our weekly newsletter.