Troubleshooting (L08 Overview)
“https://kubernetes.io/docs/tasks/debug/”
A high-level overview of the troubleshooting flow in Kubernetes. Use this as a quick reference for “where do I look when something is broken”. The deeper notes are linked below.
The first question
When something doesn’t work, the first question is always:
Is the Pod actually running?
Because:
- If the Pod is
Pending, the problem is scheduling (resources, affinity, volumes, etc.) - If the Pod is
ContainerCreating, the problem is image pull / volume mount / secret mount - If the Pod is
CrashLoopBackOff, the problem is the app or its config - If the Pod is
Runningbut the app is broken, the problem is the app, the network, or the Service
Knowing which phase the Pod is in is half the diagnosis.
The decision tree
Pod is Pending
├── Insufficient resources (CPU, memory)
├── Unschedulable taint (no Pod tolerates)
├── Affinity / anti-affinity can't be satisfied
├── NodeSelector doesn't match any node
├── PVC is Pending (storage provision failed)
├── RuntimeClass not available
└── Scheduler can't keep up
Pod is ContainerCreating
├── Image pull error
├── Volume mount error
├── Secret / ConfigMap not found
├── Init container stuck
└── CNI not ready
Pod is CrashLoopBackOff
├── App exits with error
├── App OOM-killed
├── Liveness probe kills the app
├── Config / secret missing
└── Dependency not available
Pod is Running but not in Service endpoints
├── Readiness probe failing
├── Selector mismatch
└── Port mismatch
Pod is Running, in Service, but requests fail
├── App returns 5xx
├── App returns 4xx (config issue)
├── NetworkPolicy blocking
├── DNS not resolving
└── Service is in a different namespace
The “kubectl describe” reflex
The single most useful command for troubleshooting:
kubectl describe pod <pod-name>Output sections, in order:
- Name, Namespace, Node, Labels, Annotations — basic metadata
- Status — phase (Pending / Running / etc.) and conditions
- Conditions — PodScheduled, Initialized, ContainersReady, Ready (each with True/False/Unknown)
- Containers — image, state, ready, restart count, last state
- Volumes — what’s mounted, source
- Events — at the bottom, chronological, often the only useful part
When asking for help with a Pod, paste the full kubectl describe output. It has everything needed to diagnose most issues.
The “kubectl logs” reflex
The second most useful command:
# current container
kubectl logs <pod>
# all containers
kubectl logs <pod> --all-containers
# previous instance (if the container restarted)
kubectl logs <pod> --previous
# with timestamps
kubectl logs <pod> --timestamps
# follow
kubectl logs -f <pod>The container’s stdout/stderr is in here. If the app doesn’t write to stdout, you won’t see anything. That’s an app problem, not a k8s problem — but common.
The “kubectl get events” reflex
For cluster-wide issues, the events stream is invaluable:
# all events, sorted
kubectl get events -A --sort-by=.lastTimestamp
# events for a specific object
kubectl get events --field-selector involvedObject.name=<pod>
# only warnings
kubectl get events -A --field-selector type=WarningEvents have a 1-hour TTL by default. If you need history, ship events to a log aggregator.
The four useful commands
These four commands cover 90% of troubleshooting:
# what is it doing?
kubectl describe pod <pod>
# what is it saying?
kubectl logs <pod> --previous
# what does the cluster think?
kubectl get events --field-selector involvedObject.name=<pod>
# can I reach it?
kubectl exec <pod> -- curl -v <service>.<ns>:<port>When the problem is at the cluster level
# are the nodes OK?
kubectl get nodes
kubectl describe node <node>
# are the system Pods OK?
kubectl get pods -n kube-system
# is the API server responding?
kubectl get --raw='/healthz'
# is the CNI working?
kubectl get pods -n kube-system -l k8s-app=<cni>For etcd health:
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyThe “I have no idea” checklist
When you have no clue, go through this list:
- Look at
kubectl describe pod <pod>. Read the whole output, especially Events. - Look at
kubectl logs <pod> --previous. If it’s crashlooping, the previous instance’s logs are gold. - Look at events for the namespace.
kubectl get events -n <ns> --sort-by=.lastTimestamp. - Check the node.
kubectl describe node <node>. Look for MemoryPressure, DiskPressure, PIDPressure, NotReady. - Check the system Pods.
kubectl get pods -n kube-system. If any are broken, the cluster is broken. - Check the CNI.
kubectl get pods -n kube-system -l k8s-app=<cni>. If these are down, nothing has networking. - Check DNS.
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default. If this fails, CoreDNS is broken. - Increase kubectl verbosity.
kubectl -v=8 describe pod .... Shows the raw API calls — useful for authn/authz issues.
The “kubectl debug” toolbox
Modern k8s has kubectl debug, which creates ephemeral debug containers:
# run a debug container in a Pod's namespace
kubectl debug -it <pod> --image=busybox --target=<container>
# create a copy of a Pod with a debug container
kubectl debug <pod> --image=ubuntu --copy-to=<new-pod> --share-processes
# debug a node
kubectl debug node/<node> -it --image=ubuntu
# the Pod is <node>-debug; /host is the node's root filesystemThese don’t require the Pod to have a shell, an image with shell tools, or any modification.
The “kubectl auth” toolbox
For RBAC issues:
# can I do this?
kubectl auth can-i create deployments
# can a SA do this?
kubectl auth can-i list pods --as=system:serviceaccount:default:app
# what can this user do?
kubectl auth can-i --list --as=alice@example.com -n productionIf a request is failing with 403 Forbidden, this is the first place to look.
Common issues and quick fixes
ImagePullBackOff
kubectl describe pod <pod>
# Events:
# Failed to pull image "registry.example.com/myapp:1.0":
# rpc error: code = Unknown desc = Error response from daemon:
# pull access denied for registry.example.com/myapp,
# repository does not exist or may require 'docker login'Causes:
- Wrong image name / tag
- Image doesn’t exist
- No credentials for the registry
- Network policy / firewall blocking the registry
Fixes:
- Verify the image:
docker pull <image>(locally, for dev) - Add
imagePullSecrets - Allow egress to the registry in NetworkPolicy
CrashLoopBackOff
kubectl logs <pod> --previous
# (whatever the app logged before dying)
kubectl describe pod <pod>
# Last State: Terminated, Reason: Error, Exit Code: 1Causes:
- App error (config bad, code bug)
- OOM-killed (exit code 137)
- Liveness probe too aggressive
- Missing config / secret / dependency
Fixes:
- Read the logs
- Check
resources.limits.memoryif exit 137 - Disable liveness probe temporarily to confirm
Pending
kubectl describe pod <pod>
# Events:
# 0/3 nodes are available: 3 Insufficient memory.Causes:
- Not enough cluster capacity
- Node taints not tolerated
- Affinity / anti-affinity can’t be satisfied
- PVC can’t be bound (storage class issues)
Fixes:
- Add nodes (CA / Karpenter)
- Add tolerations
- Relax affinity
- Check the PVC
Service has no endpoints
kubectl describe svc <service>
# Endpoints: <none>
kubectl get pods -l app=<service-selector>
# (all in CrashLoopBackOff or ImagePullBackOff)The Service has no Pods matching its selector that are ready. Check:
- Selector matches the Pod labels
- Pods are
Ready(readiness probe passing) - Port matches
DNS not resolving
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
# Server: 10.96.0.10
# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
# nslookup: can't resolve 'kubernetes.default'Causes:
- CoreDNS Pods aren’t running
- NetworkPolicy blocks DNS egress
dnsPolicy: Defaultin the Pod (uses node’s resolv.conf, not cluster DNS)
Fixes:
- Restart CoreDNS
- Allow UDP/TCP port 53 to
kube-systemin NetworkPolicy - Set
dnsPolicy: ClusterFirst
The notes in this level
→ kubectl Debug Toolkit — the commands you reach for → Common Failure Modes — the full triage guide → Metrics Sources — where observability data comes from