L08 — Operations
Day-2: things are running, and now you have to keep them running. This level is the troubleshooting flow and the hooks you need to operate a cluster at scale.
What you’ll understand after this level
- A systematic troubleshooting flow — from “my pod isn’t working” to “the cluster is down”
- The standard set of
kubectldebug commands and when to use each - Where logs come from (container stdout, kubelet, control plane)
- Where metrics come from (cAdvisor, kubelet, metrics-server, kube-state-metrics)
- The most common failure modes and how to recognize them
- When to drop down to the node (crictl, journalctl, /var/log)
Notes in this level
|| Note | Status | What’s in it |
|------|--------|--------------|
| Troubleshooting | ✅ | Decision tree for “my pod isn’t working” — the quick reference |
| kubectl Debug Toolkit | ✅ | describe, logs, exec, debug, ephemeral containers — the commands you reach for |
| Common Failure Modes | ✅ | Stage-by-stage triage guide, exit codes, escalation checklists |
| Metrics Sources | ✅ | Where metrics come from — cAdvisor, kubelet, metrics-server, kube-state-metrics, full stack |
Suggested reading order
- Common Failure Modes — start here, it’s the full decision tree
- kubectl Debug Toolkit — the commands you’ll use while doing the decision tree
- Metrics Sources — once you’re past “is it running”, to “is it healthy”
- Troubleshooting — the quick-reference version
Troubleshooting flow (the 30-second version)
Pod not working?
├── Is it scheduled?
│ └── Pending → resources? taints? affinity? PVC?
├── Is it creating?
│ └── ContainerCreating → image pull? volume? CNI?
├── Is it running?
│ └── Running but app broken → logs, readiness probe, Service, NetworkPolicy
└── Is it crashing?
└── CrashLoopBackOff → exit code, previous logs, OOM, probe too aggressive
Where to go next
→ L09 — Advanced: how Kubernetes itself is built — controllers, operators, etcd, internals.
Tooling for observability and log routing (Prometheus, Grafana, Loki, Fluent Bit) lives in Guides — this level is about understanding the data sources, not deploying the stack.