A slow cluster usually isn’t slow because of k8s — it’s slow because the apps are over-provisioned, under-provisioned, or fighting the scheduler. This note covers the practical levers to make workloads fast.
The 4 resources
Every container has four resources to tune:
- CPU — request (used for scheduling) and limit (throttling)
- Memory — request (used for scheduling) and limit (OOM kill)
- Ephemeral storage —
/tmp, container layers, working directory - Network — implicit, but can be a bottleneck
┌────────────────────────────────────────────────────┐
│ Container │
│ │
│ requests: "guaranteed" scheduling unit │
│ limits: "ceiling" runtime cap │
│ │
│ CPU: │
│ request = scheduler reserves this │
│ limit = throttled to this (or unlimited) │
│ │
│ Memory: │
│ request = scheduler reserves this │
│ limit = OOM kill if exceeded │
│ │
└────────────────────────────────────────────────────┘
The QoS classes
Kubernetes assigns pods to one of three QoS classes based on requests and limits. This affects scheduling and eviction behavior.
| QoS class | When assigned | Eviction order |
|---|---|---|
| Guaranteed | requests == limits (both CPU and memory) | Last |
| Burstable | requests < limits, or only one is set | Middle |
| BestEffort | No requests, no limits | First |
Guaranteed pods are the most stable but most expensive. They get scheduled onto dedicated resources and are last to be evicted under pressure.
Burstable is the default for most production apps. They get scheduled on shared resources and can burst to their limits.
BestEffort is rarely used in production — pods can use any available resource, but get killed first.
# Guaranteed
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 500m
memory: 512Mi
# Burstable
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1
memory: 1Gi
# BestEffort
# no resources blockBest practice for production: requests are mandatory, limits are usually set but higher than requests (Burstable). Guaranteed is for latency-sensitive, predictable workloads.
CPU: the four configurations
Configuration 1: No requests, no limits (BestEffort)
# Don't do this in production
containers:
- name: web
image: myorg/web:v1The pod uses whatever’s free. Under load, it can use the whole node. Under pressure, it’s killed first.
Configuration 2: Requests, no limits
resources:
requests:
cpu: 500m
memory: 512Mi
# no limitsThe pod is scheduled based on 500m CPU. It can use more if available, no upper cap. Common for CPU-bound batch jobs.
Configuration 3: Requests < limits (Burstable, default)
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1
memory: 1GiScheduled based on 500m, can burst to 1 CPU. Standard for most production apps.
Configuration 4: Requests == limits (Guaranteed)
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 1GiNo throttling, no OOM risk. Most expensive, most predictable.
CPU throttling
The Linux kernel’s CFS (Completely Fair Scheduler) throttles containers that exceed their CPU limit. Throttling is silent — the app just runs slower.
Symptoms:
# high CPU throttling on the container
$ rate(container_cpu_cfs_throttled_seconds_total[5m])
# 0.45 <-- 45% of CPU periods are throttledCommon causes:
-
CPU limit too low. App legitimately needs more.
$ kubectl top pod web-1 # CPU: 980m (limit was 1, app is hitting it)Fix: increase the limit, or set
Burstable: requests < limits. -
Multi-threaded app with one core limit. A Java app with 16 threads on a 1-CPU pod thrashes. Fix: set CPU limit = expected concurrent threads.
-
GC or other periodic CPU spikes. The JVM or runtime has periodic CPU bursts. Fix: set requests/limits to accommodate bursts.
CPU limit anti-pattern: setting CPU limits too aggressively is a common cause of mysterious slowness. Some teams run without CPU limits entirely (Burstable with no CPU limit) and let the kernel CFS handle it without explicit throttling.
Memory: OOMKilled
Memory is different from CPU — there’s no throttling, just OOM kill. If a pod exceeds its memory limit, the kernel kills it (OOMKilled).
Symptoms:
$ kubectl describe pod web-1 | tail -10
Last State: Terminated
Reason: OOMKilled
Exit Code: 137Common causes:
- Memory limit too low. App needs more.
- Memory leak. App allocates more over time, never frees.
# monitor memory over time $ watch -n 5 'kubectl top pod web-1' # if memory keeps growing, leak - Spike during startup. App uses a lot of memory during init, settles down.
# check if startup memory is high $ kubectl top pod web-1 --containers # if startup memory > steady-state, request = steady-state, limit = startup - JVM heap misconfigured. JVM’s
-Xmxis bigger than the container limit.# JVM's max heap is 1/4 of host memory by default # if container limit is 1Gi but host is 64Gi, JVM tries to use 16Gi
Fix patterns:
- Set the limit above the steady-state usage, but below the OOM risk
- For JVMs, set
-Xmxexplicitly (e.g.,-Xmx400mfor a 512Mi limit) - For leaks, fix the leak
- For startup spikes, set limit to startup, request to steady-state
Memory QoS subtleties
Linux has two memory limits: cgroup limit (the k8s memory.limit_in_bytes) and cgroup request (memory.low or memory.min). K8s uses different sub-cgroups for different QoS classes:
- Guaranteed — memory.min is set (reserved)
- Burstable — memory.low is set (best-effort reservation)
- BestEffort — no reservation
What this means: BestEffort pods are the first to be killed under memory pressure. Burstable pods are second. Guaranteed pods are last.
For predictable performance: run critical pods as Guaranteed, or as Burstable with high requests.memory.
Ephemeral storage
The /tmp, container layers, and working directory on the node’s disk. If a pod fills this, the kubelet evicts it.
Symptoms:
$ kubectl describe pod web-1 | tail
Last State: Terminated
Reason: Error
Message: container exceeded its ephemeral storage limitCommon causes:
- Log files in
/tmpor stdout. Apps that write verbose logs. - Image layers. Old images not garbage collected.
- Working directory data. Apps that write to their CWD.
Fix:
resources:
requests:
ephemeral-storage: 1Gi
limits:
ephemeral-storage: 2GiOr set node-level ephemeral storage reservation. Or use a PVC for actual data, ephemeral storage for caches only.
Network performance
K8s itself doesn’t tune the network, but there are k8s-aware patterns:
- CNI choice matters. Cilium (eBPF) is generally faster than Flannel (VXLAN). Calico is in between.
- Service mesh overhead. Istio’s sidecar adds latency (~1-2ms per hop). Linkerd is lighter.
- Cross-AZ traffic. Pod in zone A talking to pod in zone B incurs inter-AZ latency (~1-2ms) and egress cost. Topology spread to keep traffic in-zone.
# topology spread for low-latency
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: webJVM-specific tuning
JVMs need explicit tuning for containers. The default behavior is wrong for k8s.
env:
- name: JAVA_OPTS
value: >-
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=75.0
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heapdump.hprof
-Xss512k
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2
memory: 2GiMaxRAMPercentage=75— JVM uses 75% of container memory for heap. The other 25% is for non-heap (metaspace, threads, code cache).UseContainerSupport— JVM respects cgroup limits (default in Java 11+).ExitOnOutOfMemoryError— exit cleanly on OOM (lets the kubelet restart you).HeapDumpOnOutOfMemoryError— captures the heap state for postmortem.
Node-level tuning
The node’s kernel and container runtime have settings that affect performance.
Kernel parameters (sysctl)
# pod spec (privileged) or via node-level config
securityContext:
sysctls:
- name: net.core.somaxconn
value: "65535"
- name: net.ipv4.tcp_max_syn_backlog
value: "65535"
- name: vm.swappiness
value: "1"Or at the node level via kubelet config.
Container runtime
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.9"
max_container_log_line_size = 16384Disk performance
- Use local NVMe for high-IO workloads (caching, databases that need fast disk)
- Network-attached storage (EBS, GCE PD) has latency. gp3 has 125-1000 IOPS, 125-1000 MB/s. For higher, use io2.
- Avoid PVCs on busy nodes. EBS volumes have per-instance limits (28 by default on AWS).
Application-level performance
Beyond k8s, the application itself is usually the bottleneck.
Profiling
- CPU profiling —
perf, async-profiler (Java), pprof (Go), py-spy (Python) - Memory profiling — heap dumps, jemalloc stats, valgrind
- Distributed tracing — Jaeger, Zipkin, Tempo. Show which span is slow.
Common app-level issues
- Synchronous calls in hot path. A web request makes 5 database calls, each 100ms. Total 500ms. Cache or batch.
- Connection pool starvation. App has 10 DB connections, 100 concurrent requests, everyone waits.
- Lock contention. Shared lock across all requests. Move to per-key locks, or no locks.
- Garbage collection pauses. JVM can pause for seconds. Tune GC, use G1/ZGC/Shenandoah.
- Slow external dependencies. Calling out to 3rd party APIs that take 1s each.
The “is it the app or the platform?” test
# 1. is CPU saturated?
kubectl top pod web-1
# CPU: 1000m (maxed out)
# 2. is memory OK?
kubectl top pod web-1
# MEMORY: 500Mi (plenty of headroom)
# 3. is the node OK?
kubectl describe node node-1 | grep -A 5 "Allocated resources"
# CPU Requests: 30 cores (60% used)
# Memory Requests: 60Gi (50% used)
# 4. is the app slow even when not under load?
# (in staging, with no traffic, hit it directly)
kubectl port-forward pod/web-1 8080:8080
ab -n 1000 -c 10 http://localhost:8080/
# requests per second, latency
# 5. compare with another pod
# (do the same test on a different pod, see if it's specific)Latency tuning
For latency-sensitive services, every millisecond matters.
Levers:
-
CPU pinning (static CPU manager policy) — pod gets dedicated cores, no scheduling noise
# kubelet config cpuManagerPolicy: static# pod resources: requests: cpu: 4 # pod gets 4 dedicated cores -
Pod QoS = Guaranteed — no throttling, no eviction
-
Topology spread — keep traffic in-zone
-
Local node for caches — use
nodeAffinityto pin cache pods to specific nodes -
Disable swap — swap adds latency, and kubelet doesn’t support it well
Throughput tuning
For high-throughput services, parallelism matters more than per-request latency.
Levers:
- HPA scale-out — more replicas, more concurrent requests
- Bigger nodes — fewer cross-node communications
- Connection pool sizing — per-pod, sized for expected concurrency
- Async processing — push work to queues, process in batches
Profiling at the k8s layer
eBPF tools
Modern observability tools use eBPF for deep visibility:
- Pixie (
px.dev) — New Relic’s eBPF-based k8s observability - Parca — continuous profiling
- Cilium’s Hubble — network observability
- Inspektor Gadget — k8s debugging toolkit
# install Pixie
bash -c "$(curl -fsSL https://withpixie.ai/get-pixie.sh)"
# list running processes and their resources
px run px/host
# profile a specific service
px run px/profile -n my-ns --service webperf for node-level
ssh node-1
$ perf top
# see what's consuming CPU on the nodeCommon gotchas
requests.cpu: 0(or no requests) means the scheduler can pack pods unlimited onto a node. Under load, the node thrashes.- Memory limit too high doesn’t hurt much. Memory limit too low causes OOM. Err on the side of higher.
- CPU limit too high doesn’t hurt much. CPU limit too low causes throttling. Err on the side of higher.
- JVMs default to 1/4 host memory for heap. On a 64Gi node, a 1Gi container limit, the JVM tries 16Gi heap. Set
-Xmxexplicitly. - Java 8 needs
-XX:+UseContainerSupportto respect cgroup limits. Java 11+ has it on by default. - Go programs don’t GC-limit well by default. Set
GOMEMLIMIT(Go 1.19+). - Don’t set requests too high “for safety.” Over-provisioning wastes resources.
- The scheduler only knows requests. If your app uses 1 CPU but requests 2, the scheduler thinks you need 2x what you do.
- CPU throttling is silent. You can be throttled and never see an error.
- Memory pressure on a node causes kubelet to evict pods, even BestEffort ones. Guaranteed pods are last to go.
cpuManagerPolicy: staticis a kubelet setting, not a pod setting. It affects all pods on the node.- The HPA and VPA both need requests to work. Set them.
topology.kubernetes.io/regionandtopology.kubernetes.io/zoneare the standard labels, but they’re populated by the cloud provider. Self-managed clusters may not have them.
A worked example
App: Java Spring Boot web service, processing payments.
Problem: P99 latency 1.5s, occasionally 5s. Spikes correlated with GC.
$ kubectl logs -l app=payments | grep "GC pause"
[GC (Allocation Failure) 524288K->262144K(524288K), 0.345 secs]
[GC (G1 Evacuation Pause) 1G->512M(1G), 5.123 secs] <-- 5 second pause5-second GC pause = the entire pod is unresponsive for 5 seconds.
Diagnosis:
# 1. check current resource config
$ kubectl get pod -o jsonpath='{.spec.containers[0].resources}' | jq .
{
"limits": {
"memory": "2Gi"
}
# no requests, no CPU limit
}
# 2. JVM options
$ kubectl exec payments-1 -- env | grep JAVA
JAVA_OPTS=-Xmx1500m <-- but cgroup is 2Gi, JVM tries 1.5GiIssues:
- No requests — scheduler can over-pack
- No CPU limit — fine, but no throttling means GC isn’t CPU-constrained
- JVM heap = 1.5Gi, but cgroup is 2Gi, no headroom for non-heap
Fix:
env:
- name: JAVA_OPTS
value: >-
-XX:+UseG1GC
-XX:MaxRAMPercentage=70.0
-XX:+ExitOnOutOfMemoryError
resources:
requests:
cpu: 1
memory: 1.5Gi # 1.5Gi: realistic steady-state
limits:
cpu: 2
memory: 2Gi # 2Gi: allows headroomResult: P99 latency drops to 200ms. GC pauses still happen but are <100ms.
See also
- auto-scaling — HPA on resources
- cost-optimization — right-sizing is cost
- crashloop-backoff — OOMKilled diagnostics