How to actually back up and restore a k8s cluster, day to day. This is the “when things break, here’s how you fix it” guide. The patterns work for self-managed and cloud-managed clusters (with adjustments for who manages etcd).
What to back up
┌────────────────────────────────────────────────────────────┐
│ Cluster data │
│ ├── etcd (the source of truth for k8s API objects) │
│ ├── Persistent volumes (PVC contents) │
│ ├── Application configs (in git — usually) │
│ └── Secrets (in etcd, but also external stores) │
│ │
│ Add-on state │
│ ├── Cert-manager certificates │
│ ├── Argo CD / Flux state (in git, plus Redis/Postgres) │
│ ├── Cluster autoscaler state │
│ └── Custom operator state (Datadog, etc.) │
│ │
│ Application state │
│ ├── Database contents (managed separately, usually) │
│ ├── Object storage (S3, GCS) │
│ └── Queues / caches (Redis, Kafka) │
└────────────────────────────────────────────────────────────┘
For a k8s cluster backup: etcd + PVs. The rest is in git or external systems.
etcd backup
etcd is the database for the k8s API. Backing it up is the most important thing.
For self-managed (kubeadm)
# 1. find the etcd endpoints
kubectl -n kube-system get pods -l component=etcd -o wide
# NAME READY STATUS RESTARTS AGE IP NODE
# etcd-master-1 1/1 Running 0 30d 10.0.0.1 master-1
# etcd-master-2 1/1 Running 0 30d 10.0.0.2 master-2
# etcd-master-3 1/1 Running 0 30d 10.0.0.3 master-3
# 2. find the certs
ls /etc/kubernetes/pki/etcd/
# ca.crt ca.key server.crt server.key
# 3. take a snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyFor cloud-managed (EKS, GKE, AKS)
The cloud manages etcd. You don’t back it up. The cloud does (and restores it for you if you ask).
If you need cluster-state backup in cloud-managed: use Velero to back up the k8s API objects (which are in etcd). That’s what the cloud doesn’t manage.
Automating etcd backups
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 2 * * *" # daily at 2am
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: etcd-backup-sa
containers:
- name: etcd-backup
image: k8s.gcr.io/etcd:3.5.7
command:
- /bin/sh
- -c
- |
set -e
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# upload to S3
aws s3 cp /backup/etcd-$(date +%Y%m%d-%H%M).db s3://my-etcd-backups/
# cleanup local
rm -f /backup/etcd-$(date +%Y%m%d-%H%M).db
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: Directory
- name: backup
hostPath:
path: /var/backups/etcd
type: DirectoryOrCreateImportant: schedule backups when the cluster is quiet (early morning).
etcd backup gotchas
- etcd v3 vs v2. Use
ETCDCTL_API=3always. v2 is deprecated. - The etcd pod’s filesystem has the data dir. You can’t just
cpit; you need a consistent snapshot. - Encryption at rest. etcd can encrypt data, but the key must be backed up separately.
- Cross-region etcd (3 nodes in 3 AZs) — back up from any one of them.
- Backup size. etcd snapshots are small (KBs-MBs for most clusters).
- Backup duration. A snapshot is fast (seconds), even for large etcds.
- Backup verification. A snapshot that can’t be restored is useless. Verify with
etcdctl snapshot status.
Verify an etcd snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db \
--write-out=table+----------+----------+--------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+--------------+------------+
| 1a2b3c4d | 12345 | 1234 | 4.2 MB |
+----------+----------+--------------+------------+
If the snapshot is corrupt, this fails.
Velero (the standard k8s backup tool)
Velero backs up k8s API objects and PVs. It’s the right tool for most clusters.
Install Velero
# for AWS
velero install \
--provider aws \
--bucket my-velero-backups \
--prefix velero \
--secret-file ./credentials-velero \
--use-restic \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1
# for GCP
velero install \
--provider gcp \
--bucket my-velero-backups \
--prefix velero \
--secret-file ./credentials-gcp \
--use-restic
# for Azure
velero install \
--provider azure \
--bucket my-velero-backups \
--prefix velero \
--secret-file ./credentials-azure \
--use-restic
# backup PVCs with restic (filesystem-based)
# or use CSI snapshots if your CSI driver supports themSchedule daily backups
velero schedule create daily-full \
--schedule="0 2 * * *" \
--include-namespaces '*' \
--ttl 720h # 30 daysOn-demand backup
# back up a single namespace
velero backup create pre-upgrade-myapp \
--include-namespaces myapp
# back up specific resources
velero backup create config-backup \
--include-resources configmaps,secrets \
--include-namespaces myapp
# back up everything
velero backup create full-cluster \
--include-namespaces '*'List and inspect backups
velero backup get
# NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
# daily-full-2024... Completed 2024-01-15 02:00:00 +0000 UTC 29d default <none>
# pre-upgrade-... Completed 2024-01-15 14:00:00 +0000 UTC 29d default <none>
velero backup describe daily-full-2024-01-15-020000
# details: namespaces, resources, hooks, errors
velero backup logs daily-full-2024-01-15-020000
# step-by-step logRestore
# restore from a backup
velero restore create --from-backup daily-full-2024-01-15-020000
# restore to a different namespace
velero restore create --from-backup daily-full-2024-01-15-020000 \
--namespace-mappings myapp:myapp-restore
# restore specific resources
velero restore create --from-backup daily-full-2024-01-15-020000 \
--include-resources deployments,servicesVelero gotchas
- CSI snapshots vs Restic. Restic does file-level, slower for big PVs. CSI snapshots are block-level, faster, but require CSI driver support.
- CRDs are backed up by default, but not the operators/controllers that manage them. Restoring a CRD without its operator leaves the CRD in a “stuck” state.
- Velero restores to a specific namespace by default. Use
--include-namespaces '*'to restore all. - PVs are restored with the same StorageClass. If the StorageClass doesn’t exist in the target cluster, restore fails.
- Velero doesn’t back up application data in external systems (RDS, S3, etc.). Those are separate.
- Velero’s metadata is in etcd (Backup, Restore objects). If you restore etcd, the Velero CRDs come back too.
PV backups
Cloud-native snapshots (CSI)
Most cloud CSI drivers support snapshots:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: my-pv-snapshot
spec:
volumeSnapshotClassName: csi-aws-vsc
source:
persistentVolumeClaimName: my-pvcThen restore:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc-restored
spec:
dataSource:
name: my-pv-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100GiPros: fast, consistent, supported by most CSI drivers.
Cons: snapshot lives in the cloud, not portable. Need a corresponding restore target.
Velero with CSI snapshots
# install with CSI snapshot support
velero install \
--provider aws \
--bucket my-velero-backups \
--secret-file ./credentials-velero \
--features=EnableCSI# backup with CSI snapshots
apiVersion: velero.io/v1
kind: Backup
metadata:
name: myapp-with-csi
spec:
includedNamespaces: [myapp]
snapshotMoveData: false # if CSI snapshots are supported
csiSnapshotTimeout: 10mRestic (filesystem backup)
For PVs that can’t use CSI snapshots:
velero install \
--provider aws \
--bucket my-velero-backups \
--secret-file ./credentials-velero \
--use-resticRestic reads files from the PV’s filesystem and uploads them. Slower than CSI snapshots but works for any PV.
For large PVs: Restic is slow. Use CSI snapshots if you can.
What Velero doesn’t back up
Velero covers k8s API objects and PVs. It does not back up:
- Application data in external systems (RDS, DynamoDB, S3, etc.) — back those up separately
- In-cluster caches (Redis, Memcached) — design for these to be disposable
- Caches inside apps (LRU caches, in-memory state) — design for these to be rebuildable
- Logs stored in cluster (if using in-cluster Loki, etc.) — back those up separately
- Metrics (Prometheus TSDB) — not critical, can be rebuilt
- Custom controller state (operators) — usually in etcd, so it’s covered, but some have external state
A complete backup strategy
1. etcd backup (self-managed clusters)
- Daily, 30-day retention
- Encrypted, cross-region S3
- Test restore quarterly
2. Velero (cluster state)
- Daily, 30-day retention
- Encrypted, cross-region S3
- Test restore monthly
3. PV snapshots (stateful workloads)
- Daily for critical data
- Hourly for very critical
- 7-day retention
- Cross-region replication
4. Database backups (managed)
- Use the cloud's built-in
- Daily full + WAL/transaction logs
- 30-day retention
- Cross-region
5. Object storage (S3)
- Versioning enabled
- Cross-region replication
- Lifecycle policies
6. Git
- Source of truth
- Multiple remotes (mirror to GitHub, GitLab, etc.)
Restoring a cluster
Scenario 1: lost a single resource
# Velero restore
velero restore create --from-backup <backup> --include-resources deployments,servicesScenario 2: lost a namespace
velero restore create --from-backup <backup> --include-namespaces myappScenario 3: lost a node
For a lost node, just remove it from the cluster and let the cluster reschedule pods:
kubectl delete node <lost-node>
# pods on that node will be rescheduledScenario 4: lost a control plane node (self-managed)
# 1. SSH to a surviving control plane
ssh master-2
# 2. use etcdctl to remove the dead member
etcdctl member list
etcdctl member remove <dead-member-id>
# 3. if no survivors, restore from snapshot
# (see etcd restore below)Scenario 5: lost the entire cluster
For self-managed:
- Re-provision the cluster (kubeadm init)
- Restore etcd from snapshot
- Reinstall add-ons (CNI, ingress, cert-manager, etc.)
- Velero restore for namespace contents
- Verify everything
For cloud-managed:
- Re-create the cluster (terraform, eksctl, gcloud, etc.)
- Velero restore for namespace contents
- Reinstall add-ons
- Verify
The etcd restore (kubeadm)
# 1. stop the apiserver (only on the control plane you're restoring)
sudo systemctl stop kube-apiserver
# 2. move the existing data dir
sudo mv /var/lib/etcd /var/lib/etcd-old
# 3. restore the snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd
# 4. update etcd.yaml to point to the new dir
sudo vim /etc/kubernetes/manifests/etcd.yaml
# update --data-dir=/var/lib/etcd
# (or change the hostPath for the etcd-data volume)
# 5. the apiserver will pick up the change automatically
# (kubelet watches /etc/kubernetes/manifests/)
# 6. verify
kubectl get nodes
# should show all nodesFor multi-node etcd: restore to one node, then have the others rejoin.
Backup verification
The most important step. An unverified backup is just a wish.
Test the etcd snapshot
# 1. create a test cluster
kind create cluster --name backup-test
# 2. copy snapshot to a test node
docker cp /backup/etcd-snapshot.db backup-test-control-plane:/tmp/
# 3. restore in the test cluster
docker exec backup-test-control-plane bash -c "
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
--data-dir=/tmp/etcd-test
"
# 4. verify the data
docker exec backup-test-control-plane etcdctl --endpoints=:2379 get /registry --prefix --keys-only | headTest Velero
# 1. create a test cluster
kind create cluster --name velero-test
# 2. install Velero in the test cluster
velero install ...
# 3. point at the same S3 bucket
velero backup-location set --bucket my-velero-backups
# 4. restore
velero restore create --from-backup <backup>
# 5. verify
kubectl get all -A
# should match the original clusterThe “I lost everything” runbook
- Don’t panic. The cloud or backup has it.
- Identify what’s lost. Cluster? Namespace? Specific resource?
- Re-create infrastructure. New cluster (cloud-managed) or restore etcd (self-managed).
- Restore from backup. Velero for k8s objects, snapshots for PVs.
- Verify. Check critical workloads are running.
- Re-point DNS if cluster endpoint changed.
- Communicate. Internal: status. Customer: ETA.
- Post-mortem. Within a week, what failed, what worked, what to change.
Common gotchas
- Velero doesn’t back up CRDs that are in-cluster but defined by operators. If you uninstall the operator and reinstall, the CRDs are gone.
- etcd snapshot doesn’t include the encryption key. Back up the key separately.
- Restic backups are slow for large PVs. A 1TB PV can take hours.
- CSI snapshots are bound to the cloud. Can’t restore to on-prem without conversion.
- Restoring to a different k8s version can break things. Test compatibility.
- Velero’s
Backupobjects are in etcd. If you restore etcd, they come back. Useful, but can clutter. - The backup process is a workload. It needs resources, scheduling, monitoring. Not “set and forget.”
- A snapshot during heavy write load can be slow or inconsistent. Schedule for off-peak.
- The restore target cluster needs the same IAM/cloud permissions. Velero can’t snapshot PVs without the right IAM.
- Schedule backup retention policies with care. Some teams lose data because lifecycle policies deleted backups.
See also
- disaster-recovery — the bigger picture
- upgrade-strategy — backup before upgrade
- security-baseline — encrypting backups
- multi-tenancy — per-tenant restore