L08 — Operations

Day-2: things are running, and now you have to keep them running. This level is the troubleshooting flow and the hooks you need to operate a cluster at scale.

What you’ll understand after this level

  • A systematic troubleshooting flow — from “my pod isn’t working” to “the cluster is down”
  • The standard set of kubectl debug commands and when to use each
  • Where logs come from (container stdout, kubelet, control plane)
  • Where metrics come from (cAdvisor, kubelet, metrics-server, kube-state-metrics)
  • The most common failure modes and how to recognize them
  • When to drop down to the node (crictl, journalctl, /var/log)

Notes in this level

|| Note | Status | What’s in it | |------|--------|--------------| | Troubleshooting | ✅ | Decision tree for “my pod isn’t working” — the quick reference | | kubectl Debug Toolkit | ✅ | describe, logs, exec, debug, ephemeral containers — the commands you reach for | | Common Failure Modes | ✅ | Stage-by-stage triage guide, exit codes, escalation checklists | | Metrics Sources | ✅ | Where metrics come from — cAdvisor, kubelet, metrics-server, kube-state-metrics, full stack |

Suggested reading order

  1. Common Failure Modes — start here, it’s the full decision tree
  2. kubectl Debug Toolkit — the commands you’ll use while doing the decision tree
  3. Metrics Sources — once you’re past “is it running”, to “is it healthy”
  4. Troubleshooting — the quick-reference version

Troubleshooting flow (the 30-second version)

Pod not working?
  ├── Is it scheduled?
  │   └── Pending → resources? taints? affinity? PVC?
  ├── Is it creating?
  │   └── ContainerCreating → image pull? volume? CNI?
  ├── Is it running?
  │   └── Running but app broken → logs, readiness probe, Service, NetworkPolicy
  └── Is it crashing?
      └── CrashLoopBackOff → exit code, previous logs, OOM, probe too aggressive

Where to go next

L09 — Advanced: how Kubernetes itself is built — controllers, operators, etcd, internals.

Tooling for observability and log routing (Prometheus, Grafana, Loki, Fluent Bit) lives in Guides — this level is about understanding the data sources, not deploying the stack.