Chaos engineering is the practice of deliberately breaking things to learn how the system fails. The goal: find weaknesses before they cause outages, build confidence in your HA, and train the team to respond.
The principles
From the Principles of Chaos:
- Build a hypothesis around steady-state behavior
- Vary real-world events (kill pods, drop network, etc.)
- Run experiments in production (or staging that mirrors production)
- Automate experiments to run continuously
- Minimize blast radius (start small, expand)
Steady state is the key concept: what does “normal” look like? If you can’t define it, you can’t measure when chaos breaks it.
Hypothesis: "If 1 of 5 web pods is killed, the system stays healthy"
Steady state: 4+ pods running, error rate <0.1%, latency <500ms p99
Experiment: kill 1 pod, observe
Result: matches steady state? → no action needed
breaks steady state? → fix something, then re-test
The chaos maturity model
| Level | Practice |
|---|---|
| 0 — Manual | Ad-hoc, no automation |
| 1 — Scripted | Bash scripts, run on demand |
| 2 — Scheduled | Cron, runs at fixed times |
| 3 — Tooled | Chaos Mesh, Litmus, Gremlin, etc. |
| 4 — Continuous | Always running, in production |
| 5 — Game days | Quarterly team exercises |
Most teams are at 1-3. Continuous chaos in production is a Netflix-grade practice. Start with game days and tooled experiments.
The experiments
Common k8s chaos experiments, ordered by impact:
Tier 1: Pod-level (start here)
Kill a pod
kubectl delete pod web-1
# or
kubectl exec web-1 -- kill 1What it tests: Pod restarts, Service endpoint updates, readiness probes, PDBs.
Steady state: Other pods handle the load. No error rate spike.
Hypothesis: “If 1 of 5 web pods is killed, the Service still has 4 backends within 10s, error rate <0.1%.”
Crash a pod (OOM)
# stress memory
kubectl exec web-1 -- sh -c "tail /dev/zero | head -c 1G > /tmp/big"
# or
kubectl exec web-1 -- stress --vm 1 --vm-bytes 1G --timeout 60What it tests: OOMKill, pod restart, memory limit enforcement.
Steady state: Pod restarts, comes back. Other pods unaffected.
Hypothesis: “If a pod OOMs, it’s restarted, memory limit is enforced, and steady-state latency is preserved.”
Throttle CPU
# use chaos-mesh or stress
kubectl exec web-1 -- stress --cpu 4 --timeout 60What it tests: CPU limits, throttling behavior, scheduler responses.
Tier 2: Network-level
Drop traffic to a Service
# using chaos-mesh
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: drop-traffic
spec:
action: partition
selector:
namespaces:
- my-app
labelSelectors:
app: web
direction: to
duration: 5m
EOFWhat it tests: NetworkPolicy, mTLS, retry logic, circuit breakers.
Steady state: App degrades gracefully, errors have a clear cause, no silent failure.
Add latency
# chaos-mesh
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: add-latency
spec:
action: delay
delay:
latency: 500ms
correlation: "100"
jitter: 50ms
selector:
namespaces: [my-app]
duration: 5mWhat it tests: Timeouts, retries, slow downstream handling.
Steady state: App has timeouts. Doesn’t pile up requests. Returns errors instead of hanging.
DNS failure
# chaos-mesh
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: dns-failure
spec:
mode: all
selector:
namespaces: [kube-system]
labelSelectors:
k8s-app: kube-dns
stressors:
dns:
patterns:
- "FAIL\0"
probability: 100What it tests: DNS retry, caching, fallback to IPs.
Tier 3: Node-level
Drain a node
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-dataWhat it tests: PDBs, graceful shutdown, pod rescheduling, anti-affinity.
Steady state: Pods reschedule to other nodes. No data loss. Brief latency increase.
Hypothesis: “If a node is drained, pods move to other nodes within 2 minutes, no pods are stuck terminating.”
Kill a kubelet
ssh node-1
sudo systemctl stop kubeletWhat it tests: Node controller timeout, PLEG, pod rescheduling.
Steady state: Node marked NotReady within 5 minutes, pods rescheduled.
Network partition a node
# using iptables on the node
sudo iptables -A INPUT -s <other-nodes> -j DROP
# or using chaos-mesh NetworkChaos with action: partitionWhat it tests: Network resilience, what happens when a node can’t reach the apiserver.
Tier 4: Cluster-level
Kill the apiserver
The hardest one. Disrupts all cluster operations.
# on a master node
sudo systemctl stop kube-apiserverWhat it tests: Existing pods keep running. New pods can’t be scheduled.
Steady state: Existing pods continue serving. After 5 minutes, node controller marks nodes NotReady.
⚠️ Don’t do this in production without a controlled test environment.
etcd failure
# on an etcd node
sudo systemctl stop etcdWhat it tests: etcd quorum loss, apiserver behavior, data consistency.
Steady state: With 3 etcd nodes, losing 1 keeps quorum. With 5, losing 2 still works.
Tier 5: Application-level
Corrupt the database
Drop a table, kill a transaction, replicate a bad row.
What it tests: App’s error handling, data validation, recovery.
Hypothesis: “If the database returns an error, the app returns 5xx to the user, doesn’t crash, recovers when DB is back.”
Slow down the database
Add latency to DB queries.
What it tests: App’s DB timeouts, connection pool behavior, query timeouts.
Steady state: App times out DB calls, doesn’t pile up requests, returns errors fast.
The tools
Chaos Mesh
CNCF project. Most common in k8s.
# install
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh --create-namespace
# a stresschaos experiment
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: pod-cpu-stress
namespace: chaos-mesh
spec:
mode: all
selector:
namespaces: [my-app]
labelSelectors:
app: web
stressors:
cpu:
workers: 2
load: 80
duration: 5m
EOFCapabilities: pod failure, network partition, latency, DNS, IO, time skew, stress (CPU/memory/IO), kernel-level (kill, panic).
Litmus
Another CNCF chaos engineering tool.
# install
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
# or via Helm
# run an experiment
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/pod_delete/experiment.yamlCapabilities: pre-built experiments, observability, chaos workflow.
Gremlin (commercial)
Hosted chaos engineering platform. More polished UX.
# install the Gremlin agent
helm install gremlin gremlin/gremlin \
--set gremlin.secret.managed=true \
--set gremlin.teamID=<your-team-id>Capabilities: all of the above, plus stateful attacks, network conditions, host-level.
Litmus vs Chaos Mesh vs Gremlin
| Tool | Cost | Pre-built exp | UI | Production use |
|---|---|---|---|---|
| Chaos Mesh | Free | Many | Web UI | Yes |
| Litmus | Free | Many | Web UI | Yes |
| Gremlin | $$$ | Many | Web UI | Yes |
| Steadybit | $$$ | Many | Web UI | Yes |
For most teams, Chaos Mesh is the right balance. Free, comprehensive, well-maintained.
Game days
A game day is a planned chaos exercise. The team gathers, runs experiments, observes, and learns.
Run quarterly. Set aside 4 hours. Have an agenda:
- Hypothesis presentation — “we believe X” (15 min)
- Baseline measurement — confirm steady state (15 min)
- Experiments — run 3-5 chaos scenarios (90 min)
- Findings — what broke, what surprised us (45 min)
- Action items — fix the things we found (30 min)
- Post-mortem — what did we learn (15 min)
Run in staging first if you have one. Production if you don’t, with a controlled blast radius.
Have a “stop” button. The person running the experiment should have the ability to abort everything if it goes wrong.
Continuous chaos
After a few game days, automate the experiments. Chaos experiments should run continuously in production, with safe defaults.
# example: a scheduled experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: weekly-pod-kill
spec:
schedule: "0 14 * * 1" # every Monday 2pm
type: PodChaos
historyLimit: 5
concurrencyPolicy: Forbid
podChaos:
action: pod-kill
mode: one
selector:
namespaces: [staging]
labelSelectors:
app: webSteady state monitoring:
# Prometheus alert
- alert: ChaosExperimentBrokeSteadyState
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
annotations:
summary: "Error rate spiked during chaos experiment"If the experiment breaks the steady state, the alert fires, the experiment should be aborted (or auto-aborted by the tooling).
The blast radius
Start small. Expand as you gain confidence.
| Phase | Experiment | Blast radius |
|---|---|---|
| 1 | Kill 1 pod of non-critical service | 1 pod, 1 service |
| 2 | Kill 1 pod of critical service | 1 pod, but monitoring on it |
| 3 | Drain 1 node | 1 node, but only in staging |
| 4 | Network partition between 2 services | 2 services, in staging |
| 5 | Kill all pods in a Deployment (one at a time) | 1 Deployment, in staging |
| 6 | Drain a node in production | 1 node, with monitoring |
| 7 | Zone failure simulation | 1 zone, with traffic shifting |
| 8 | Region failure simulation | 1 region, with DR invocation |
Always have a rollback plan. If the experiment goes sideways, how do you recover in 5 minutes?
Steady state hypotheses
Some good starting hypotheses:
| Experiment | Hypothesis |
|---|---|
| Kill 1 pod | ”Replicas reduce to 4, no error rate spike, no data loss.” |
| Kill all pods of a service | ”Service comes back within 2 min, error rate spike <5%, no cascading failures.” |
| Add 500ms latency between services | ”P99 latency increases by 500ms, error rate stays the same, timeouts trigger correctly.” |
| Drain a node | ”Pods reschedule to other nodes within 5 min, no PDB violations.” |
| Network partition between 2 services | ”Circuit breaker triggers, calls fail fast, system recovers when partition heals.” |
| OOMKill a pod | ”Pod restarts with same memory, OOM kill is recorded, no cascading OOMs.” |
| CPU stress to 100% | “Pod is throttled, no others affected (CPU limits work), throttling visible in metrics.” |
Each hypothesis should be specific and measurable.
The “is this chaos or just a real outage?” question
In production, chaos experiments blend with real failures. Three ways to tell:
- Mark experiments clearly. Use labels, tags, or chaos-specific namespaces.
- Log experiments in the chaos platform. Gremlin, Chaos Mesh, etc. all log experiments.
- Run experiments in pairs. If the experiment broke something, the team knows. If something else broke, the experiment’s silent.
Don’t hide chaos experiments. If you’re running them, the team should know.
Common gotchas
- Chaos in production requires buy-in. SRE, engineering, leadership. Don’t surprise people.
- Steady state must be measurable. If you can’t measure it, chaos is theater.
- The first experiment always reveals something. Be ready to learn and fix.
- Don’t run experiments on critical services without monitoring. Always have observability before breaking things.
- Network partitions are hard to clean up. Some chaos tools leave iptables rules behind. Verify the cleanup.
- The “stop” button is critical. Have someone who can abort all experiments in seconds.
- Game days require prep. Walk through the experiments mentally before running them.
- Some experiments require permissions you don’t have. Killing kubelet, restarting etcd — you might not have SSH access in production. Use chaos tools that work via the k8s API.
- Chaos Mesh / Litmus / Gremlin have their own blast radius. Their pods have permissions to do harmful things. Restrict with RBAC.
- Continuous chaos in production is advanced. Start with game days, automate slowly.
A worked example
Goal: Validate that the checkout service can survive a downstream (payment gateway) latency spike.
Hypothesis: “If the payment gateway returns in 2s instead of 100ms, the checkout service times out after 1s, returns a 503 to the user, and recovers when the latency is removed.”
Setup:
- Define steady state: P99 latency <500ms, error rate <0.1%, no failed transactions
- Install chaos-mesh in staging
- Run NetworkChaos to add 2s latency to payment-gateway traffic
- Run for 10 minutes
- Monitor: latency, error rate, transaction success rate
- Compare against steady state
Run the experiment:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: payment-latency
namespace: chaos-mesh
spec:
action: delay
delay:
latency: 2s
correlation: "100"
jitter: 0
selector:
namespaces: [prod]
labelSelectors:
app: payment-gateway
direction: to
duration: 10mObserve:
T+0min: Latency 100ms, errors 0.05%
T+2min: Latency 200ms, errors 0.08% (system starting to feel pressure)
T+5min: Latency 800ms, errors 0.5% (some timeouts triggering)
T+8min: Latency 1100ms, errors 2% (circuit breaker should trigger)
T+10min: Chaos ends
T+12min: Latency 100ms, errors 0.05% (recovered)
Findings:
- ✅ Timeouts trigger correctly
- ✅ Circuit breaker engaged at 1s (config working)
- ✅ Recovery is fast (within 2 min of chaos ending)
- ❌ Some users saw 503s during the experiment (acceptable for partial outage, but not great)
- ❌ No graceful degradation — users had to retry
Actions:
- Improve user-facing error message (“Payment processing slow, please try again”)
- Add a circuit breaker at the frontend (BFF pattern)
- Increase timeout to 1.5s for better tolerance
Re-test in 2 weeks to validate fixes.
The “blast radius” for chaos
Always start small. A chaos experiment that takes down a critical service is not a learning experience — it’s an outage.
| Blast radius | When to use |
|---|---|
| 1 pod, non-critical | First day of chaos |
| 1 pod, critical | After 5+ non-critical experiments |
| 1 node, dev | After 10+ pod experiments |
| 1 node, prod | After 5+ node experiments in dev |
| All pods of a service | After 20+ pod experiments |
| Network partition | After node experiments |
| Zone failure | Game day, multiple teams observing |
| Region failure | Once a year, full team |
The chaos experiment template
Every chaos experiment should have:
# a ChaosExperiment manifest
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: web-pod-kill
namespace: chaos-mesh
spec:
action: pod-kill
mode: one
selector:
namespaces: [my-app]
labelSelectors:
app: web
duration: 1m
# what to verify
# - service stays healthy (4+ backends)
# - error rate <0.1%
# - no data lossHypothesis (write this down before running):
“If 1 of 5 web pods is killed, the Service will have 4 backends within 10s. Error rate will briefly spike to <1%, returning to <0.1% within 30s. No requests will be lost (load balancer retries to healthy backends).”
Run the experiment. Compare against hypothesis. Document findings.
Steady state metrics
The metrics you watch during chaos:
- Pod restarts (
kube_pod_container_status_restarts_total) - Endpoint availability (
kube_endpoint_address_available) - Service backend count (custom, scrape from the Service)
- HTTP error rate (5xx, 4xx, by status code)
- Request latency (p50, p95, p99)
- Saturation (CPU, memory, network, IO)
- Custom business metrics (orders/sec, queue depth)
# Prometheus alerts for chaos experiments
- alert: ErrorRateDuringChaos
expr: |
sum(rate(http_requests_total{status=~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
> 0.05
for: 1m
annotations:
summary: "Error rate above 5% during chaos experiment"
- alert: LatencyDuringChaos
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))
> 1.0
for: 1m
annotations:
summary: "P99 latency above 1s during chaos"If any of these fire during chaos, you found a problem.
Common chaos scenarios (organized by attack vector)
Compute
- Pod kill — kubelet restarts the pod
- Pod failure (OOM, segfault) — test memory limits
- CPU stress — test CPU limits and QoS
- Memory stress — test OOM handling
- IO stress — test disk pressure handling
- Process kill (specific PIDs) — test error handling
Network
- Network partition — split brain scenarios
- Latency — slow downstream
- Packet loss — degraded network
- DNS failure — name resolution issues
- Bandwidth limit — saturated network
- Corrupt packets — bad network paths
Storage
- Disk fill — full disk
- IO error injection — bad blocks
- PVC delay — slow attach
- Snapshot failure — backup issues
Time
- Clock skew — NTP issues, certificate failures
- Time travel — leap seconds, daylight saving
State
- Database crash — failover testing
- Cache eviction — cold cache
- Connection pool exhaustion — DB connection issues
- Queue depth spike — sudden load
External
- Cloud API failure — IAM, EC2, S3 unreachable
- DNS provider down — public DNS issues
- Registry unavailable — can’t pull images
Chaos as code
Write chaos experiments as code, version controlled:
chaos/
├── README.md
├── pod-level/
│ ├── kill-pod.yaml
│ ├── oom-pod.yaml
│ └── cpu-stress.yaml
├── network/
│ ├── partition.yaml
│ ├── latency.yaml
│ └── packet-loss.yaml
├── storage/
│ ├── disk-fill.yaml
│ └── io-error.yaml
└── advanced/
├── zone-failure.yaml
└── region-failure.yaml
Each file is a Chaos Mesh (or Litmus) experiment. Reuse, version, share.
The 5 phases of a chaos program
Phase 1: Foundations (months 1-2)
- Set up the chaos tool (Chaos Mesh, Litmus, Gremlin)
- Run experiments in dev only
- Document findings
- Build the team’s confidence
Phase 2: Game days (months 3-4)
- Quarterly game days
- Multiple teams involved
- Documented runbooks
- Tabletop exercises
Phase 3: Scheduled experiments (months 5-6)
- Move experiments to cron
- Run in staging weekly
- Production experiments manually approved
Phase 4: Continuous chaos (months 7-9)
- Experiments run automatically in production
- Auto-abort on steady-state violations
- Real-time observability
Phase 5: Chaos-driven development (months 10+)
- Every new feature gets a chaos test
- Pre-merge validation
- Production-readiness reviews include chaos
The “what broke” report
After every experiment, write up what you learned:
# Experiment: Kill 1 web pod
Date: 2024-01-15
Hypothesis: Service stays healthy with 4 backends
## What we observed
- 4 backends within 8s (expected 10s) ✓
- Error rate: 0.5% spike for 5s, returned to 0.05% (expected <0.1%) ✓
- No data loss ✓
## What surprised us
- Readiness probe was 30s, not the 10s we thought
- PDB was not configured (was set to 0 from a previous test)
- HPA was at minReplicas=2, but only because we set it manually
## Action items
- [ ] Fix readiness probe interval (now 10s)
- [ ] Set PDB minAvailable=2 (was 0)
- [ ] Verify HPA minReplicas in production manifestThis is the most valuable output of chaos engineering. The findings drive improvements.
The 3am chaos experiment
Once you’re running continuous chaos, you’ll have an experiment fire at 3am. Make sure:
- The experiment is labeled clearly as chaos (not real outage)
- The team knows how to distinguish chaos from real
- The auto-abort works (or the on-call knows to abort)
- The on-call rotation is aware of the chaos schedule
Surprise 3am experiments are bad. Communicate clearly.
See also
- high-availability — what to test
- disaster-recovery — broader failure modes
- node-not-ready — real-world failure
- Principles of Chaos