StatefulSets
“https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/”
A StatefulSet is a controller that manages a set of Pods with stable, unique identities and persistent per-replica storage. It’s the right choice for stateful workloads — databases, message queues, key-value stores — where each replica has a role that’s different from the others.
The problem with Deployments for stateful apps
A Deployment’s Pods are interchangeable — they’re all replicas of the same service. When a Pod is replaced, it gets a new random name, a new random IP, and a new random identity.
For most apps, that’s fine. The Service load-balances to whatever Pod is ready; clients don’t care which one they hit.
For stateful apps, it’s not fine:
- A Postgres primary is different from a replica. You don’t load-balance to both; you route writes to the primary, reads to replicas.
- A Kafka broker 0 is different from broker 1. The cluster needs to know which broker is which.
- A ZooKeeper ensemble member is different from another. The leader vs follower role matters.
These apps need:
- Stable, predictable names (
postgres-0,postgres-1,postgres-2) - Stable, predictable network identities (DNS that points to the right Pod)
- Stable, per-replica storage (data that survives Pod replacement)
- Ordered deployment and scaling (start 0, then 1, then 2)
- Ordered termination (kill 2, then 1, then 0)
- Ordered rolling updates (update 0, then 1, then 2 — and respect readiness)
StatefulSets provide all of this. Deployments don’t.
Basic example
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: postgres }
spec:
serviceName: postgres # the headless Service that gives the Pods DNS
replicas: 3
selector:
matchLabels: { app: postgres }
template:
metadata:
labels: { app: postgres }
spec:
containers:
- name: postgres
image: postgres:15
ports:
- containerPort: 5432
name: postgres
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates: # per-replica PVC, created automatically
- metadata: { name: data }
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources:
requests: { storage: 100Gi }This creates:
- A StatefulSet
postgreswith 3 replicas - 3 Pods:
postgres-0,postgres-1,postgres-2 - 3 PVCs:
data-postgres-0,data-postgres-1,data-postgres-2 - 3 PVs (dynamically provisioned by the
gp3StorageClass)
The PVCs persist even if the StatefulSet is deleted. The Pods are recreated with the same names and mount the same PVCs.
Stable identity
Each Pod in a StatefulSet has:
- Stable name:
postgres-0,postgres-1,postgres-2(zero-indexed) - Stable hostname: set to the Pod name (
postgres-0) - Stable DNS: a
Headless Service(see below) provides DNS records
The Pod name is deterministic — postgres-0 is always Pod 0 in the StatefulSet, no matter how many times it’s recreated.
Headless Service
A StatefulSet needs a headless Service to provide DNS for its Pods. The clusterIP: None setting means the Service has no virtual IP — DNS returns the Pods’ IPs directly.
apiVersion: v1
kind: Service
metadata: { name: postgres }
spec:
clusterIP: None # headless
selector: { app: postgres }
ports:
- port: 5432
name: postgresThe StatefulSet spec says serviceName: postgres. This is the headless Service.
DNS records for the StatefulSet’s Pods:
postgres-0.postgres.default.svc.cluster.local → 10.244.1.5
postgres-1.postgres.default.svc.cluster.local → 10.244.2.7
postgres-2.postgres.default.svc.cluster.local → 10.244.3.9
postgres.default.svc.cluster.local → all 3 IPs (in random order)
Clients can address individual Pods by name (postgres-0.postgres.default.svc.cluster.local) or the headless Service for all of them.
Per-replica persistent storage
volumeClaimTemplates create a PVC per Pod. Each Pod mounts its own PVC.
volumeClaimTemplates:
- metadata: { name: data }
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources: { requests: { storage: 100Gi } }For 3 replicas, k8s creates 3 PVCs:
data-postgres-0data-postgres-1data-postgres-2
Each Pod mounts the PVC with its index. When the Pod is replaced, the new Pod mounts the same PVC, and the data is there.
The PVCs are not deleted when the StatefulSet is deleted (default persistentVolumeReclaimPolicy: Retain). You have to delete them manually if you want to free the storage.
Ordered operations
StatefulSets have strict ordering for several operations:
Scaling up
When scaling from 3 to 5 replicas:
postgres-3is created- Wait for
postgres-3to be Ready postgres-4is created- Wait for
postgres-4to be Ready
postgres-3 must be Ready before postgres-4 starts. If postgres-3 fails, postgres-4 is never created.
Scaling down
When scaling from 5 to 3:
postgres-4is terminated (gracefully)- Wait for
postgres-4to be fully terminated postgres-3is terminated- Wait for
postgres-3to be fully terminated
postgres-4 is killed before postgres-3. The reverse of scale-up.
Rolling update
When you change the Pod template:
postgres-0is updated- Wait for
postgres-0to be Ready postgres-1is updated- Wait for
postgres-1to be Ready - … etc.
Each replica is updated in order, waiting for Readiness before moving on.
The default update strategy is RollingUpdate with partition: 0 (update all). You can set partition: 2 to only update Pods with index ≥ 2 (canary).
The “broken” Pod behavior
StatefulSets have a podManagementPolicy that controls what happens when a Pod is “broken” (failed, can’t become Ready).
OrderedReady (default)
- Operations are strictly ordered
- A broken Pod blocks the next one — the StatefulSet doesn’t move on
- If
postgres-2fails, no subsequent Pods are updated
This is the safe default. If something’s wrong, you find out fast.
Parallel
- All Pods are created/updated/deleted in parallel
- No ordering
- Faster, but you can have multiple broken Pods
Use this when:
- The app doesn’t depend on ordinals
- You want fast rollouts
- You’re OK with a brief inconsistency window
Update strategies
RollingUpdate (default)
Updates Pods one at a time, in order, waiting for Readiness.
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0 # update all Pods with index >= 0
maxUnavailable: 1 # allow 1 Pod to be down during updatepartition: N means: only update Pods with index ≥ N. Use for canary rollouts:
# canary: update only Pods with index >= 2
kubectl patch statefulset postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
# change the template
kubectl set image statefulset postgres postgres=postgres:16
# only postgres-2 gets updated; postgres-0 and postgres-1 stay on the old version
# if canary looks good, update everyone
kubectl patch statefulset postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'OnDelete
Old Pods are kept until you manually delete them. The StatefulSet doesn’t recreate them.
Useful for manual upgrade procedures (e.g. you have a custom script that drains a Postgres replica, runs the migration, then deletes the Pod so the StatefulSet recreates it).
The Pod identity, in detail
A Pod in a StatefulSet has:
- Name:
<statefulset-name>-<ordinal>(e.g.postgres-0) - Hostname: same as the name
- Subdomain:
<pod-name>.<service-name>.<namespace>.svc.cluster.local(e.g.postgres-0.postgres.default.svc.cluster.local) - Stable storage: the PVC from
volumeClaimTemplates - Ordinal: 0, 1, 2, … (assigned by the StatefulSet)
The ordinal is part of the Pod’s name but also implicit in the StatefulSet’s tracking. If you delete postgres-0 and the StatefulSet recreates it, the new Pod is postgres-0 (same name, same ordinal, different UID).
The Pod’s name is also a label:
statefulset.kubernetes.io/pod-name: postgres-0
The headless Service DNS, in detail
For a StatefulSet postgres with 3 replicas in namespace default, the headless Service provides:
# for each Pod
postgres-0.postgres.default.svc.cluster.local. A 10.244.1.5
postgres-1.postgres.default.svc.cluster.local. A 10.244.2.7
postgres-2.postgres.default.svc.cluster.local. A 10.244.3.9
# for the Service
postgres.default.svc.cluster.local. A 10.244.1.5
A 10.244.2.7
A 10.244.3.9
(The Service’s A records are all the backing Pods. The headless service doesn’t load-balance — it just returns all the IPs.)
This DNS setup is what makes stable identities useful: a client can connect to postgres-0.postgres.default.svc.cluster.local:5432 and always reach the same Pod (until the Pod is rescheduled).
PVC lifecycle
PVCs created from volumeClaimTemplates follow this lifecycle:
- StatefulSet is created with 3 replicas → 3 PVCs (
data-postgres-0,data-postgres-1,data-postgres-2) are created - A Pod is replaced (rescheduled, restarted, etc.) → it mounts the same PVC, the data is there
- StatefulSet is scaled down (3 → 1) → the Pods
postgres-2andpostgres-1are deleted, but their PVCs are retained - StatefulSet is scaled up again (1 → 3) → Pods
postgres-1andpostgres-2are recreated, mounting the same PVCs — data is preserved - StatefulSet is deleted → the Pods are deleted, but the PVCs are not (the StatefulSet’s
persistentVolumeClaimRetentionPolicydefaults toRetain)
To delete the PVCs, you have to do it manually:
kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2This is a feature, not a bug. The data outlives the StatefulSet.
PVC retention policy
You can control this:
spec:
persistentVolumeClaimRetentionPolicy:
whenScaled: Retain # PVCs survive scale-down
whenDeleted: Retain # PVCs survive StatefulSet deletionwhenScaled: Delete would delete PVCs when the StatefulSet is scaled down. Almost never what you want — you’d lose data.
whenDeleted: Delete is also rarely correct. The whole point of StatefulSets is that data outlives Pods.
Headless Service vs regular Service
You can have both a headless Service (for Pod-to-Pod addressing) and a regular Service (for client access):
# headless Service: for direct Pod addressing
apiVersion: v1
kind: Service
metadata: { name: postgres }
spec:
clusterIP: None
selector: { app: postgres }
ports:
- port: 5432
name: postgres
---
# regular Service: for client load balancing
apiVersion: v1
kind: Service
metadata: { name: postgres-lb }
spec:
selector: { app: postgres }
ports:
- port: 5432
name: postgresIn-cluster clients can use:
postgres-lb.default.svc.cluster.local:5432— load-balanced (random Pod)postgres-0.postgres.default.svc.cluster.local:5432— specific Pod 0
Init containers in StatefulSets
Init containers run per-Pod, like in Deployments. They run before the main container starts.
spec:
template:
spec:
initContainers:
- name: init-replica
image: postgres:15
command:
- sh
- -c
- |
# wait for previous replica to be reachable, then run init
until pg_isready -h postgres-$((ORDINAL - 1)) -p 5432; do
sleep 2
done
# initialize this replica
env:
- name: ORDINAL
valueFrom:
fieldRef:
fieldPath: metadata.name # e.g. "postgres-0"
# extract "0" with a scriptThis is the kind of thing you need to set up replica chains for Postgres or MySQL — replica N waits for replica N-1 to be ready.
Use cases
StatefulSets are for stateful apps. Some examples:
- Databases: Postgres, MySQL, MongoDB, Cassandra, ScyllaDB, CockroachDB
- Message queues: Kafka, RabbitMQ, NATS
- Coordination: ZooKeeper, etcd
- Search: Elasticsearch, Solr, OpenSearch
- Custom: any app that has per-replica identity and state
For all of these, you also typically need an operator to handle the operational complexity (failover, backup, scaling, etc.). A bare StatefulSet doesn’t do that.
When NOT to use a StatefulSet
- Stateless apps — use a Deployment
- One Pod per node — use a DaemonSet
- Run-to-completion tasks — use a Job
- The app doesn’t need stable identity — use a Deployment
If you can replace a Pod with another random Pod and nothing breaks, you don’t need a StatefulSet.
The “Pod stuck Pending” debugging
A common StatefulSet issue: a Pod is Pending because its PVC is Pending. This usually means the storage class can’t provision a volume. The PVC is waiting for a PV that matches.
kubectl get pvc -l app=postgres
# NAME STATUS VOLUME CAPACITY ACCESS MODES
# data-postgres-0 Pending # waiting for PV
kubectl describe pvc data-postgres-0
# Events:
# ProvisioningFailed: storageclass "gp3" not found
# (or similar)The fix is usually:
- Make sure the storage class exists
- Make sure the cluster can provision volumes
- Check for quota issues
The “Pod stuck terminating” debugging
When you delete a StatefulSet, the Pods are terminated in order. If one is stuck terminating:
kubectl get pods -l app=postgres
# NAME READY STATUS RESTARTS AGE
# postgres-2 1/1 Terminating 0 5m
# postgres-1 1/1 Terminating 0 6m
# postgres-0 1/1 Terminating 0 7mA Pod is “stuck terminating” usually because:
- The container isn’t responding to SIGTERM (no graceful shutdown)
- A
preStophook is hanging - A volume is stuck unmounting
# force-delete (last resort)
kubectl delete pod postgres-0 --force --grace-period=0This skips the graceful termination. The Pod’s PVC might be left in a weird state.
The “init container in wrong order” issue
If you have multiple StatefulSets with init containers that depend on each other, ordering can be tricky. The StatefulSet doesn’t know about other StatefulSets.
For example, if postgres-0’s init container waits for kafka-0, but the StatefulSets are created in the wrong order, the init container times out.
Solutions:
- Use a Job to wait for dependencies
- Use an operator to manage ordering
- Use init containers with longer timeouts and good error messages
Gotchas
- A StatefulSet without a headless Service doesn’t work. The
serviceNamefield requires a Service withclusterIP: None. If the Service doesn’t exist, the StatefulSet’s Pods can’t be created (they fail DNS validation). volumeClaimTemplatescannot be changed once the StatefulSet is created. You can changespec.template.spec.containers, but adding a newvolumeClaimTemplaterequires a new StatefulSet.- A StatefulSet with a headless Service and a regular Service is a common pattern. Make sure they have different names (or one selector excludes the other).
- The
partitionfield inupdateStrategyis per-StatefulSet, not per-Pod. A value of 2 means “update Pods with index ≥ 2” — sopostgres-0andpostgres-1stay on the old version. - Scaling down doesn’t delete PVCs (by default). If you re-scale up, the data is there. If you want a clean slate, delete the PVCs manually.
- A StatefulSet that owns a
Headless Servicedoesn’t manage the Service. You create the Service separately. The StatefulSet just uses the name in DNS. podManagementPolicy: Parallelis faster but riskier. WithOrderedReady, a broken Pod blocks the update. WithParallel, all Pods are updated, even if some are broken.- A
StatefulSet’s Pod template cannot usehostPath. Some teams try to usehostPathfor stateful data; the StatefulSet will reject it. Use PVCs. - The StatefulSet’s
replicasfield is required. Unlike a Deployment, you can’t omit it. - Rolling update of a StatefulSet is per-Pod and sequential. For 10 replicas, the update takes 10 * (startup time + ready time). For a slow-starting app, this can be hours.
- Init containers can’t use the PVCs of previous ordinals. Each Pod only sees its own PVC. You have to coordinate across Pods via the headless Service.
- The
serviceNameis required, not optional. Some tools assume it can be derived. It’s not.
StatefulSet + operator
For real production, don’t run a StatefulSet alone for databases. Use an operator:
- Postgres: CloudNativePG, Zalando
- MySQL: MySQL Operator (Oracle), Percona
- MongoDB: MongoDB Operator
- Kafka: Strimzi
- Cassandra: Cass Operator
- Elasticsearch: ECK (Elastic Cloud on Kubernetes)
These operators wrap a StatefulSet (or set of StatefulSets) and add:
- Automated failover (when a primary dies, promote a replica)
- Backup and restore
- Scaling (with the right ordering)
- Version upgrades (with the right procedure)
- Monitoring (Prometheus metrics)
A bare StatefulSet gives you stable identity and per-replica storage, but the operational stuff is up to you.
See also
- Deployments — the alternative for stateless apps
- PersistentVolumeClaim — what StatefulSets use for storage
- Operators — for real database management
- Services — headless Services