AWS just gave Lambda the ability to pause, checkpoint, and resume. Here's what that means for your workflow architecture.
At re:Invent 2025, AWS announced Lambda Durable Functions — the ability to write multi-step, stateful workflows directly in Lambda code with automatic checkpointing, suspension for up to one year, and failure recovery. No Step Functions state machine. No SQS queues stitching Lambdas together. Just Python or Node.js code that can pause and resume.
This is the biggest Lambda feature since custom runtimes. It directly addresses the most common reason teams reach for Step Functions: coordinating multi-step workflows where each step depends on the previous one. But it doesn't replace Step Functions entirely. The two tools solve overlapping but different problems.
Here's how Durable Functions work, when to use them, and when Step Functions is still the right call.
A durable function is a regular Lambda function that can call ctx.checkpoint() to save its execution state. When the function hits a checkpoint, Lambda serializes the current state, suspends the function, and frees the compute. When the function needs to continue — after a timer, an event, or an API call — Lambda restores the state and resumes from exactly where it left off.
Here's a minimal example in Python:
import aws_lambda_durable as durable
@durable.function
def process_order(ctx, event):
order_id = event['order_id']
# Step 1: Validate the order
validation = validate_order(order_id)
ctx.checkpoint()
# Step 2: Charge payment (might take seconds or days for approval)
payment = charge_payment(order_id, validation['total'])
ctx.checkpoint()
# Step 3: Fulfill the order
fulfillment = fulfill_order(order_id, payment['transaction_id'])
ctx.checkpoint()
# Step 4: Send confirmation
send_confirmation(order_id, fulfillment['tracking_number'])
return {'status': 'completed', 'order_id': order_id}
That's a four-step workflow in a single function. Each ctx.checkpoint() saves the state. If the function fails at step 3, it resumes from the last checkpoint — step 2's result is already saved, so it doesn't re-charge the payment. If step 2 needs to wait for a manual approval, the function suspends for hours or days without consuming compute.
Currently available for Python 3.13, Python 3.14, Node.js 22, and Node.js 24. GA in US East (N. Virginia and Ohio). Global rollout expected Q2 2026.
The killer use case that AWS highlighted — and the one that makes Durable Functions genuinely new rather than just convenient — is AI agent orchestration.
An AI agent workflow typically looks like this: receive a user request, call an LLM, parse the response, call a tool or API based on the response, call the LLM again with the tool result, repeat until done. Each LLM call might take 5-30 seconds. The total workflow might involve 10-20 steps. Some steps require human approval.
Before Durable Functions, you had two options: keep a Lambda running for the entire duration (expensive, hits the 15-minute timeout), or orchestrate with Step Functions (works but requires translating your agent logic into a state machine DSL). Durable Functions let you write the agent loop as normal code:
@durable.function
def ai_agent(ctx, event):
messages = [{'role': 'user', 'content': event['prompt']}]
max_iterations = 20
for i in range(max_iterations):
# Call the LLM
response = call_llm(messages)
ctx.checkpoint()
# Check if the agent wants to use a tool
if response.get('tool_call'):
tool_result = execute_tool(response['tool_call'])
ctx.checkpoint()
messages.append({'role': 'tool', 'content': tool_result})
else:
# Agent is done
return {'result': response['content']}
return {'result': 'max iterations reached'}
Each LLM call and tool execution gets checkpointed. If the function fails or times out, it resumes from the last successful step. If a tool call requires human approval, the function suspends until the approval comes in. The total execution might span minutes or hours, but you only pay for the seconds of actual compute.
Durable Functions are deployed like regular Lambda functions with a few additional configuration properties. Here's the CloudFormation:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
OrderProcessor:
Type: AWS::Serverless::Function
Properties:
Runtime: python3.14
Handler: handler.process_order
CodeUri: ./src
Timeout: 900
MemorySize: 256
DurableFunction:
Enabled: true
MaxSuspensionSeconds: 86400
CheckpointStorage:
Type: S3
BucketArn: !GetAtt CheckpointBucket.Arn
Policies:
- S3CrudPolicy:
BucketName: !Ref CheckpointBucket
CheckpointBucket:
Type: AWS::S3::Bucket
Properties:
LifecycleConfiguration:
Rules:
- ExpirationInDays: 30
Status: Enabled
The key additions: DurableFunction.Enabled: true, a MaxSuspensionSeconds limit, and checkpoint storage configuration. AWS manages the serialization; you manage the storage bucket and its lifecycle rules.
They're not interchangeable. Here's the decision framework:
| Requirement | Durable Functions | Step Functions |
|---|---|---|
| Sequential multi-step workflow | Yes | Yes |
| Logic lives in code (not DSL) | Yes | No — ASL/JSON |
| Parallel branches | Limited | Yes — native |
| Visual workflow editor | No | Yes |
| Wait for external event | Yes (up to 1 year) | Yes (up to 1 year) |
| Complex branching/choice logic | Code — if/else | ASL Choice state |
| Map over array (fan-out) | Manual | Native Map state |
| Error handling with retries | Code — try/except | Built-in Retry/Catch |
| AI agent loops | Natural fit | Possible but awkward |
| Cross-service orchestration | Via SDK calls | Native AWS integrations |
| Audit/compliance visibility | CloudWatch logs | Built-in execution history |
Durable Functions are new, and the edges are rough in a few places:
Lambda Durable Functions don't replace Step Functions. They replace the pattern of using Step Functions when all you really needed was a Lambda function that could pause and resume. That pattern is extremely common — it covers most sequential workflows, most human-in-the-loop approvals, and most AI agent loops.
If you've ever built a Step Functions state machine that was essentially "call Lambda A, then call Lambda B, then call Lambda C" with some error handling, Durable Functions let you write that as a single function with checkpoints. The code is simpler, the deployment is simpler, and the debugging is simpler.
For anything with parallel execution, visual workflows, or deep AWS service integration, Step Functions remains the right tool. But for the 60-70% of workflows that are "do these things in order," Durable Functions are the better fit. And for AI agent orchestration specifically, they're not just better — they're the obvious choice.
Published by Yaw Labs.
Interested in AI tools and developer workflows? Token Limit News is our weekly newsletter.