High Availability
“https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/”
HA in Kubernetes means no single point of failure in the control plane and redundancy for workloads. The control plane runs on multiple nodes; workloads use PodDisruptionBudgets, replicas, and spread across failure domains.
Control plane HA
The kube-apiserver is the piece that must not go down. Everything else (scheduler, controller-manager) can lose one instance without breaking workloads — but no apiserver means no API, which means nothing works.
┌──────────────────────────────────────┐
│ load balancer │
│ (cloud LB or keepalived + haproxy) │
└──────────┬──────────┬───────────────┘
│ │
┌────────▼──┐ ┌─────▼──────┐ ┌──────▼─────┐
│ kube- │ │ kube- │ │ kube- │
│ apiserver │ │ apiserver │ │ apiserver │
│ (node 1) │ │ (node 2) │ │ (node 3) │
└─────┬─────┘ └─────┬──────┘ └──────┬────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────┐
│ etcd │
│ (3 or 5 node cluster) │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ 0 │ │ 1 │ │ 2 │ │
│ └─────┘ └─────┘ └─────┘ │
└──────────────────────────────────────────┘
Two HA topologies
Stacked etcd (easier, more common):
- etcd runs on the same nodes as the control plane
- Fewer machines, but correlated failures — if a control plane node goes down, you lose both an apiserver and an etcd member
- Suitable for: 3-node clusters, managed k8s (EKS, GKE, AKS)
External etcd (more resilient):
- etcd runs on separate nodes from the control plane
- More machines, but independent failure domains
- Suitable for: 5+ node control planes, ultra-high-availability requirements
What the load balancer does
The LB sits in front of the apiserver instances:
# what nodes join the cluster
kubeadm join --control-plane \
--server https://lb.example.com:6443 \
--certificate-key <key>
# on each apiserver node
# the LB health check:
# https://<apiserver>:6443/healthz
# If an apiserver fails the health check, the LB stops routing to itThe LB must:
- Terminate TLS (or pass it through, but then clients need the certs of all apiservers)
- Do TCP health checks on port 6443
- Distribute traffic to all healthy apiservers
- Support long-lived connections (apiserver watches are long-lived HTTP/2)
In cloud environments, use the cloud’s managed LB (AWS NLB, GCP TCP LB, Azure LB). On-prem, use keepalived + haproxy or a hardware load balancer.
etcd quorum
etcd requires a quorum to work: majority of members must be available.
| etcd members | Quorum | Tolerates failures |
|---|---|---|
| 1 | 1 | 0 |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
With 3 members, you can lose 1. With 5, you can lose 2. Beyond 5, adding more members makes cluster changes slower without meaningfully improving availability.
Always use 3 or 5 etcd members. Never 2 (you can’t tolerate any failure with 2 — losing one member leaves you with 1 of 2, which is not a majority).
Workload HA
HA for workloads means no single point of failure at the application layer.
PodDisruptionBudgets
A PDB limits how many Pods of a given app can be down simultaneously due to voluntary disruption:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: web-pdb }
spec:
minAvailable: 2 # always keep at least 2 Pods
# OR
maxUnavailable: 1 # never have more than 1 Pod down
selector:
matchLabels:
app: webDuring a kubectl drain or a Deployment rollout, the PDB is respected. If minAvailable: 2 and you have 3 replicas, only 1 Pod is evicted at a time.
Replicas
The simplest HA: run multiple replicas.
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # zero downtime
maxSurge: 1 # one extra during rolloutWith maxUnavailable: 0, the Deployment never drops below 3 Pods. If you lose one, you’re still at 3 until a new one starts.
Multi-zone/node spread
Spread Pods across failure domains so a single zone or node failure doesn’t take down the app.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: webThis says: spread across zones (maxSkew: 1 means at most 1 Pod more in any zone than the minimum), and also across nodes. If a zone goes down, only Pods in that zone are affected.
Anti-affinity
Keep Pods away from each other (or co-locate them):
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web
topologyKey: kubernetes.io/hostnameThis prefers (but doesn’t require — preferred) scheduling Pods on different nodes. If you want a hard requirement, use requiredDuringSchedulingIgnoredDuringExecution.
Node HA
Node lifecycle management
If a node is unresponsive:
- The node controller marks it
NotReadyafternode-monitor-grace-period(default 40s) - Pods on that node are evicted after
pod-eviction-timeout(default 5m) - New Pods are not scheduled to it
The window between node failure and Pods being rescheduled is voluntary disruption — the app is degraded but still running on the dead node until the PDB allows eviction.
cordon / drain / delete
# cordon: stop scheduling new Pods to this node
kubectl cordon node-3
# drain: evict all Pods (respecting PDBs), then mark as unschedulable
kubectl drain node-3 --ignore-daemonsets --delete-emptydir-data
# delete: remove the node object from the API
# (the node is already gone, just clean up)
kubectl delete node node-3For routine maintenance (kernel update, kubelet restart), cordon + drain is the right sequence. For a dead node, just delete it.
Taints and tolerations for critical nodes
# taint a node so only critical pods schedule there
kubectl taint node node-3 dedicated=true:NoSchedule
# pods must have the matching toleration to be scheduled
kubectl label namespace kube-system pod-role=criticalThe node.kubernetes.io/not-ready:NoExecute taint is automatically applied to nodes that are NotReady. Pods without a matching toleration are evicted after the tolerationSeconds (default 300s).
Application-level HA
The cluster can keep your Pods running, but your app needs to handle:
- Graceful shutdown: catch SIGTERM, stop accepting new connections, finish in-flight requests, exit within
terminationGracePeriodSeconds - Connection draining: if you’re behind a Service, the Service stops routing to a Pod that’s terminating — but existing connections need to drain
- Readiness probing: don’t receive traffic until you’re ready (otherwise you get 503s during startup)
- Idempotency: retries are safe (GET is idempotent; POST might not be — design for it)
- Leader election: if you have a single active process (not a replicated app), use leader election so another instance takes over when the leader dies
Managed k8s HA
If you’re on EKS, GKE, or AKS, the control plane is managed for you:
| EKS | GKE | AKS | |
|---|---|---|---|
| Control plane | Managed (multi-AZ) | Managed (multi-AZ) | Managed (single-region) |
| etcd | Managed | Managed | Managed |
| API server | Managed | Managed | Managed |
| Node pools | Your responsibility | Your responsibility | Your responsibility |
| Add-ons | EKS addons | GKE addons | Azure addons |
On managed k8s, the main HA work is:
- Multi-AZ node pools (spread nodes across AZs)
- PodDisruptionBudgets on every production app
- topologySpreadConstraints for zone-level resilience
- Cluster autoscaler so nodes scale with demand
- HPA so Pods scale with demand
What “five 9s” means in practice
99.999% uptime = ~5 minutes of downtime per year. For most applications, this is unrealistic. The realistic targets:
| Target | Downtime/year | What it requires |
|---|---|---|
| 99% | 3.7 days | Basic HA, no single points of failure |
| 99.9% | 8.7 hours | Multi-AZ, PDBs, HPA, decent monitoring |
| 99.99% | 52 minutes | All of the above + runbooks, alerts, fast recovery |
| 99.999% | 5 minutes | Everything + automated failover, chaos engineering, very fast MTTR |
For infrastructure like k8s control planes, 99.99% is achievable. For applications, 99.9% is the typical target.
Common HA mistakes
- No PodDisruptionBudgets — a
kubectl draintakes down all Pods simultaneously - Single-replica Deployments — one node failure takes down the app
- All Pods in one zone — a zone outage takes down the app
maxUnavailable: 1on a Deployment with 1 replica — this means 0 Pods during rollout- No readiness probe — traffic is sent to Pods that aren’t ready, causing 503s
terminationGracePeriodSeconds: 30with a startup that takes 60s — the Pod is killed before it starts- Not testing chaos — HA setup that hasn’t been tested is not HA. Run chaos experiments (kill nodes, kill Pods, kill API servers).
- etcd on 2 nodes — you can’t tolerate any failure with 2 etcd members. Use 1, 3, or 5.
See also
- PodDisruptionBudget — protecting voluntary disruptions
- Resource Requests & Limits — for proper scheduling
- Deployments — rolling updates
- StatefulSets — for stateful HA