When us‑east‑1 Sneezes: A Field Guide to the Oct 19–20 AWS Outage
Updated on

Date of event: Oct 19–20, 2025 (Pacific Time) Primary trigger: DNS automation defect for DynamoDB’s regional endpoint Blast radius: DynamoDB → EC2 launches → Network Manager → NLB → (Lambda, STS, IAM sign‑in, Redshift, ECS/EKS/Fargate, Connect) in us‑east‑1
TL;DR (for busy humans)
- What kicked it off (11:48 PM Oct 19): A race condition in DynamoDB’s automated DNS management flipped the regional endpoint
dynamodb.us-east-1.amazonaws.comto an empty record. Anything trying to open a new connection to DynamoDB in us‑east‑1 failed DNS resolution. - Primary DynamoDB recovery (2:25–2:40 AM Oct 20): AWS restored correct DNS; customers recovered as cached DNS entries expired.
- Why it didn’t end there: EC2’s internal control plane depends on DynamoDB for host (“droplet”) leases. Those leases expired during the DNS window, leading to congestive collapse in the droplet workflow manager (DWFM) and a backlog in Network Manager. New EC2 instance launches errored or came up without network state.
- Load balancer turbulence (5:30 AM–2:09 PM): Network Load Balancer (NLB) health checks saw flapping due to delayed network propagation, causing intermittent capacity removal and connection errors.
- Full steady‑state return: Most services stabilized by early afternoon Oct 20; Redshift stragglers (due to EC2 replacements) finished by 4:05 AM Oct 21.
The “what happened,” in one diagram-like timeline (PDT)
| Time window (Oct 19–20) | What customers saw | What actually broke under the hood |
|---|---|---|
| 11:48 PM → 2:40 AM | DynamoDB API errors in us‑east‑1; anything starting new connections failed | DNS Planner/Enactor race produced an empty Route 53 record for DynamoDB’s regional endpoint; manual fix restored records at 2:25 AM; clients recovered as caches expired by 2:40 AM |
| 2:25 AM → 1:50 PM | EC2: launch errors (“insufficient capacity” / “request limit exceeded”), elevated API latency | DWFM leases to physical hosts (“droplets”) had lapsed; recovery created massive re‑lease work, causing queue collapse. Throttling + selective restarts (from 4:14 AM) enabled catch‑up; by 5:28 AM leases restored; throttles relaxed 11:23 AM; full recovery 1:50 PM |
| 5:30 AM → 2:09 PM | NLB: increased connection errors for some load balancers | NLB health checks flapped because Network Manager hadn’t fully propagated network state to new instances; auto AZ failover removed capacity; 9:36 AM: disabled auto failover to stop capacity churn; re‑enabled 2:09 PM |
| Service-by-service echoes | Lambda (to 2:15 PM), STS (to 9:59 AM), Console IAM sign‑in (to 1:25 AM), ECS/EKS/Fargate (to 2:20 PM), Connect (to 1:20 PM), Redshift query/control plane (most by 2:21 AM, some compute until Oct 21 4:05 AM) | Each had a different coupling to DynamoDB, EC2 launch capacity, NLB health, or IAM/STS, so the symptoms arrived and cleared on different clocks |
Two minutes on the core concepts (for newer readers)
DNS (Domain Name System) Think of DNS as the Internet’s phone book. You ask for a name; it returns where to call (IP addresses). At hyperscale, providers use DNS records and health checks to steer traffic across huge fleets and Availability Zones. It’s fast and ubiquitous, but not a transactional database; consistency and change‑ordering are hard problems that must be engineered around.
DynamoDB AWS’s fully managed key‑value/NoSQL database. Super low‑latency, elastic, and used everywhere—including by AWS itself for control planes and metadata. Global Tables replicate across Regions. If your app or an AWS subsystem needs to read/write control state and can’t reach DynamoDB, bad things stack up.
EC2 & the control plane EC2 runs your instances atop “droplets” (physical hosts). Internal services manage leases to those hosts and propagate network state (routes, security groups, etc.). Launching a new instance is easy—unless the internal coordination is unhealthy or backlogged.
NLB (Network Load Balancer) A layer‑4 load balancer. It does health checks and can route across AZs. If health checks flap or DNS‑based failover is over‑eager, you can momentarily remove too much capacity—precisely when you need more.
What actually failed: the DynamoDB DNS automation
AWS splits DynamoDB DNS management into two roles:
- DNS Planner: computes the desired record sets (which load balancers, with which weights) for each endpoint (regional, FIPS, IPv6, account‑specific).
- DNS Enactor: three redundant workers in different AZs that apply those plans to Route 53, each doing transactional updates.
Under unusual contention, one Enactor progressed slowly while another raced ahead with newer plans and then garbage‑collected older generations. The slow Enactor’s stale “older” plan then overwrote the regional endpoint immediately before the fast Enactor’s cleanup deleted that older plan—atomically removing all IPs for the endpoint and leaving the system in a state that blocked further automatic updates. Humans intervened to repair.
Why this mattered: the regional endpoint vanished. Anything that needed a new DNS resolution to DynamoDB in us‑east‑1 promptly failed. Existing connections generally kept working.
Why it cascaded: tight couplings and time’s arrow
-
Control-plane coupling (EC2 ⇄ DynamoDB). EC2’s Droplet Workflow Manager (DWFM) keeps leases on physical hosts. Those checks depend on DynamoDB. During the DNS event, leases timed out, making droplets ineligible for new instance placement. When DynamoDB recovered, DWFM had to re‑establish leases across a huge fleet, hit queue timeouts, and congested.
-
Delayed network propagation. Even when instances began launching again, Network Manager had a backlog publishing their network state. Instances came up without full connectivity until propagations caught up (~10:36 AM).
-
Health-check feedback loops (NLB). New capacity with incomplete network state looked unhealthy, so NLB withdrew nodes or AZs from DNS, temporarily shrinking capacity, which made connection errors worse. Operators paused automatic health‑check failover to stop the oscillation, then re‑enabled it after EC2 stabilized.
-
Downstream service echo. Lambda throttled certain paths to protect synchronous invocations; STS/IAM sign‑in faltered, then recovered; container control planes (ECS/EKS/Fargate) and Connect felt the combined EC2/NLB effects; Redshift had an additional twist: IAM‑backed query auth briefly failed globally, and some clusters waited for EC2 host replacements, dragging into Oct 21.
If you’re thinking “Swiss cheese model”—many thin layers of defense aligned just so—you’re reading this right.
A complex‑systems lens (why “root cause” isn’t enough)
Several practitioners on HN invoked Richard Cook’s How Complex Systems Fail and Perrow’s Normal Accidents. That framing fits eerily well here:
- No single cause. The DNS race lit the match, but the long tail came from EC2’s metastable recovery mode and NLB’s health‑check oscillation—each a locally rational design that interacted poorly under stress.
- Systems usually run degraded. Planner/Enactor retries, lease expirations, backlogs—normal friction the system tolerates—aligned in unlucky ways.
- People create safety. Automation stalled; operators improvised repairs (manual DNS restore, DWFM throttling/restarts, disabling NLB auto‑failover) to re‑stitch the fabric.
Takeaway: Fix the bug, yes—but also hunt for metastable recovery traps and feedback loops that make small failures snowball.
What AWS says they’re changing
- DynamoDB DNS automation: disabled globally while they fix the race; add safeguards to prevent applying or deleting incorrect plans.
- NLB: add a velocity control so health‑check failover can’t remove too much capacity too quickly.
- EC2: new recovery test suites for DWFM; better queue‑aware throttling across data propagation systems.
These are the right classes of mitigations: prevent bad writes, pace self‑healing, and exercise recovery paths under load.
What you can do (Monday‑morning checklist)
Even if you can’t change AWS, you can soften the blow in your own stack.
-
Prefer regional endpoints and remove hidden us‑east‑1 dependencies. Where services offer regionalized control planes (e.g., STS), use them. Don’t centralize auth or metadata in us‑east‑1 by habit.
-
Design for multi‑Region data access—especially for critical control state. DynamoDB Global Tables can let you fail over reads/writes to a second Region. Your code must tolerate replication lag and reconcile later.
-
Fail closed on health‑check oscillation. Gate automatic failover with damping: multiple consecutive failures, hysteresis, and per‑minute capacity removal caps. Consider a manual override circuit breaker to stop flapping.
-
Make dependency graphs explicit. Document which services must be up to launch/scal e (auth, queues, config stores). Practice “what if X is slow/unavailable for 2 hours?”
-
Practice metastable recoveries. Chaos‑day for the control plane, not just the data plane:
- Simulate lease expiry on a fleet slice.
- Flood your network‑config propagator.
- Rehearse back‑pressure and throttle‑lift sequencing.
-
Cache with intent. DNS caches hid the recovery lag (clients needed TTL expiry). For SDKs that open new connections aggressively, consider connection pooling and exponential backoff with jitter to avoid synchronized retries.
-
Build “safe mode” runbooks.
- Reduced concurrency or rate limits under error spikes.
- Prefer draining backlogs before re‑enabling auto‑scalers.
- Use feature flags to disable expensive background jobs during control‑plane incidents.
-
Guard your own “planner/enactor” patterns. If you generate desired state and apply it elsewhere, add:
- Versioned plans with compare‑and‑swap semantics.
- Cleanup that cannot delete the last known‑good plan.
- A single‑writer lease per endpoint or a small sequencer (vector clock / monotonic counter).
Was this “misusing DNS”?
Not exactly. At hyperscale, DNS is a service‑discovery and traffic‑steering tool because every client, library, and VPC resolver speaks it. The sin here wasn’t using DNS; it was allowing concurrent writers and a janitor to interact without iron‑clad ordering, then letting health‑check‑driven DNS remove capacity too quickly during a network propagation backlog. Those are fixable engineering choices.
Glossary (60‑second refreshers)
- Route 53: AWS’s DNS and health‑check service. Supports weighted/latency/health‑based routing and transactional record changes.
- TTL (time to live): How long a resolver may cache a DNS answer. Short TTLs speed failover and amplify query rate.
- DWFM (Droplet Workflow Manager): EC2 subsystem that manages leases to physical hosts.
- Network Manager: Propagates VPC/network configuration to instances and network appliances.
- Global Tables (DynamoDB): Multi‑Region replication for tables, with eventual consistency and conflict resolution.
Final thoughts
This outage wasn’t “just DNS.” It was DNS + control‑plane recovery + health‑check dynamics + human intervention—a reminder that availability is a system property, not a component feature. The encouraging part is how specific the mitigations are: serialize plan application, pace capacity removal, pressure‑test recovery. Copy those patterns into your own stack, and the next time us‑east‑1 sneezes, your app will need fewer tissues.