Auto-Scaling

Four layers of scaling work in concert: HPA scales pods, VPA rightsizes pod resources, Cluster Autoscaler / Karpenter scales nodes, KEDA scales based on event sources. Used together, they form an elastic system. Misused, they fight each other.

The four scalers

Scaler	Scales	Trigger	Status
HPA (HorizontalPodAutoscaler)	Pod replicas	CPU, memory, custom metrics	GA
VPA (VerticalPodAutoscaler)	Pod resources (requests/limits)	Historical usage	Beta
Cluster Autoscaler (CA)	Nodes	Pending pods (unschedulable)	GA
Karpenter	Nodes	Pending pods (direct provisioning)	GA
KEDA	Pod replicas (via HPA)	External event sources (Kafka, SQS, cron, etc.)	GA

                          ┌─────────────────┐
                          │   Application   │
                          │   traffic       │
                          └────────┬────────┘
                                   │
        ┌──────────────────────────┼──────────────────────────┐
        │                          │                          │
        ▼                          ▼                          ▼
   ┌─────────┐              ┌─────────────┐            ┌─────────────┐
   │   HPA   │              │  Cluster    │            │    KEDA    │
   │  more   │              │  Autoscaler │            │  scale on  │
   │  pods   │              │  / Karpenter│            │  events    │
   └────┬────┘              └──────┬──────┘            └──────┬──────┘
        │                         │                          │
        │      Pending pods      │      Schedulable         │
        │ ◄──────────────────────┤      capacity            │
        │                         │      changes             │
        ▼                         ▼                          ▼
   ┌─────────────────────────────────────────────────────────────┐
   │                          Cluster                            │
   │                                                             │
   │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐        │
   │   │  Pod    │  │  Pod    │  │  Pod    │  │  Pod    │        │
   │   └─────────┘  └─────────┘  └─────────┘  └─────────┘        │
   │                                                             │
   │   ┌────────────┐           ┌────────────┐                   │
   │   │   Node 1   │           │   Node 2   │                   │
   │   └────────────┘           └────────────┘                   │
   └─────────────────────────────────────────────────────────────┘

VPA is the orthogonal one: it doesn’t change pod count, it changes the resources requested/limited on each pod.

HPA — Horizontal Pod Autoscaler

HPA scales the number of pod replicas in a Deployment, StatefulSet, ReplicaSet, or ReplicationController. It uses the Metrics API to get resource usage, then computes the desired replica count.

The algorithm

desiredReplicas = ceil(currentReplicas * currentMetricValue / desiredMetricValue)

With targetMetricValue: 70 (target 70% CPU), currentReplicas: 4, currentMetricValue: 90 (90% CPU):

desiredReplicas = ceil(4 * 90 / 70) = ceil(5.14) = 6

Cap at maxReplicas. Floor at minReplicas.

Basic HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # wait 5min before scaling down
    scaleUp:
      stabilizationWindowSeconds: 0     # scale up immediately
      policies:
      - type: Percent
        value: 100                        # can double
        periodSeconds: 60

What HPA needs

metrics-server (for CPU/memory) — most clusters have this.
Pod resource requests — HPA computes utilization as usage / request. If requests are missing, the math doesn’t work.
Custom metrics adapter (for non-CPU/memory) — Prometheus Adapter, KEDA, etc.

# verify HPA can read metrics
kubectl get hpa web-hpa
# NAME      REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# web-hpa   Deployment/web     45%/70%   2         20        4          1h
 
# if "TARGETS" is "<unknown>/70%", metrics-server isn't working

Common HPA mistakes

No resource requests set on pods. HPA needs requests.cpu to compute utilization.
CPU-only scaling for a memory-bound app. Profile your app; scale on the right metric.
Aggressive scale-up + slow scale-down = flapping. Set stabilizationWindowSeconds to dampen.
HPA + VPA on CPU/memory = conflict. Don’t run both on the same metric.
Scaling a StatefulSet with persistent volumes. Each replica might need its own PV. HPA works but PV allocation needs to scale with it.

HPA on custom / external metrics

The Metrics API is extensible. Common adapters:

Prometheus Adapter — query Prometheus, expose as Metrics API

- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"

KEDA — see below
Cloud-specific — AWS CloudWatch, GCP Stackdriver (via adapters)

Setup is non-trivial: deploy the adapter, configure the APIService, register the metrics. The win is scaling on business metrics (queue depth, RPS, error rate) instead of just CPU.

VPA — Vertical Pod Autoscaler

VPA rightsizes the resources (requests/limits) on each pod. It watches historical usage and recommends or applies better values.

Three modes

off — recommendations only, no action
initial — set requests on pod creation, never update
auto — evict pods and recreate with updated requests (DISRUPTIVE)

Basic VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Auto"      # or "Off" for recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: web
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

When to use VPA

Stateless workloads with predictable resource patterns
Right-sizing new deployments — start with Off, look at recommendations, set manually
Memory leaks — VPA won’t help with a leak, but it will tell you the leak’s growth rate
Cost reduction — many clusters have over-provisioned pods; VPA finds the right size

When NOT to use VPA

HPA is using CPU/memory — conflict. VPA and HPA on the same metric will fight.
Stateful workloads with PVCs — VPA evicts pods to apply new requests; this disrupts the workload.
Latency-sensitive services — eviction causes a brief capacity dip.
Workloads with sidecars — VPA can misjudge if it doesn’t account for sidecar resources.

VPA + HPA: the right combination

Use them on different metrics:

# HPA on CPU
metrics:
- type: Resource
  resource:
    name: cpu
    target: { type: Utilization, averageUtilization: 70 }
 
# VPA on memory
# (VPA is in "Auto" mode for memory only)

Or use VPA in Off mode (recommendations) and HPA on a custom metric (RPS, queue depth, etc.).

Cluster Autoscaler (CA)

CA watches for unschedulable pods and adds nodes. When nodes are underutilized, it removes them.

How it works

pending pod
    ↓
CA sees it
    ↓
CA finds the best node group (based on pod requirements)
    ↓
CA asks the cloud to add nodes (via ASG, MIG, etc.)
    ↓
new nodes come up
    ↓
kube-proxy, CNI, kubelet bootstrap
    ↓
pending pod schedules

When to use CA

Cloud-managed clusters (EKS, GKE, AKE) — CA is built-in or trivial to enable
Stable, predictable workloads — CA is conservative; takes minutes to scale up
Cost-conscious — CA removes underutilized nodes after 10+ minutes

CA config

# EKS — already integrated
# Just create node groups with min/max/desired
 
# Standalone CA deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --cloud-provider=aws
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<cluster-name>
        - --balance-similar-node-groups
        - --expander=least-waste
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m

CA gotchas

Scale-up takes 3-5 minutes (cloud LB provisioning, image pull, kubelet bootstrap). Don’t expect instant.
Scale-down is conservative — 10 minutes of underutilization by default. Adjust for your workload.
Node groups must be tagged correctly for auto-discovery.
Bin-packing is greedy — CA adds nodes that fit the largest pending pod, may not be optimal.
Spot instances + CA — possible, but spot reclamation causes pod rescheduling.

Karpenter

Karpenter is the modern replacement for Cluster Autoscaler. Instead of working through node groups, Karpenter directly provisions nodes that match pending pod requirements.

How it works

pending pod
    ↓
Karpenter sees it (via direct apiserver watch)
    ↓
Karpenter picks the best instance type for the pod's requirements
    ↓
Karpenter calls the cloud API to provision a node
    ↓
node boots, kubelet joins, pod schedules

Karpenter advantages over CA

Faster — 30-60s to scale up, vs 3-5min for CA
Right-sized — Karpenter picks the instance type, not the node group
Spot-friendly — Karpenter has built-in spot, on-demand, and capacity-optimized strategies
Consolidation — Karpenter actively rebalances workloads to fewer, better-fit nodes
Simpler — no node groups, no ASG configs

Karpenter NodePool

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["c", "m", "r"]
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values: ["4"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "1000"
    memory: 4000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

When to use Karpenter

New clusters — start with Karpenter, skip CA entirely
Spot-heavy workloads — Karpenter’s spot handling is much better than CA’s
Mixed instance types — Karpenter picks the best fit; no need to pre-define node groups
Cost optimization — Karpenter’s consolidation actively reduces nodes
Large scale (1000+ nodes) — Karpenter scales better than CA

Karpenter gotchas

Karpenter doesn’t replace all of CA — for some cloud features, you still need node groups.
Consolidation can be aggressive — pods get evicted to bin-pack. Use PodDisruptionBudgets.
Instance type requirements must be specified carefully. Too narrow = no nodes found. Too broad = expensive nodes.
Node expiration — expireAfter evicts pods after N hours. Plan for stateful workloads.

KEDA — Kubernetes Event-Driven Autoscaling

KEDA extends HPA to scale based on event sources beyond CPU/memory.

What KEDA can scale on

Message queues — Kafka, RabbitMQ, SQS, Azure Service Bus, GCP Pub/Sub
Databases — PostgreSQL, MongoDB, Redis (query lag, queue depth)
Cron — schedule-based scaling
Metrics — Prometheus, Datadog, CloudWatch
HTTP — request rate
Webhooks — custom event sources

Example: Kafka lag-based scaling

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
spec:
  scaleTargetRef:
    name: kafka-consumer
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: my-consumer-group
      topic: orders
      lagThreshold: "100"     # scale up if any partition lag > 100

KEDA watches the Kafka consumer group lag, scales the Deployment to keep lag under 100.

When to use KEDA

Async workloads — workers consuming from queues
Batch processing — scale to N workers when N records are pending
Bursty traffic from external sources — webhooks, events
Replacing custom code — instead of writing your own scaler

KEDA gotchas

KEDA is a separate operator. Adds operational overhead.
Scaling to zero is supported, but cold-start time matters.
Stateful workloads — KEDA scale-to-zero kills the workload. Make sure state is external.
Some scalers need credentials — Kafka SASL, AWS IAM, etc.

Putting it together

The right combination for most production clusters:

┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  KEDA (event-driven scaling for async workers)                 │
│    └─> scales kafka-consumer, cron workers, webhook handlers   │
│                                                                │
│  HPA (CPU/memory/custom metrics for stateless services)        │
│    └─> scales web, api based on CPU or RPS                     │
│                                                                │
│  VPA (recommendation mode for rightsizing)                     │
│    └─> outputs recommendations; humans set values              │
│                                                                │
│  Karpenter (cluster scale)                                      │
│    └─> adds/removes nodes based on pending pods                │
│                                                                │
│  PodDisruptionBudgets (safety net)                              │
│    └─> prevents scaling from killing too many pods at once     │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Anti-patterns to avoid

HPA + VPA on the same metric — they’ll fight. Use one or the other, or VPA in Off mode.
CA + Karpenter — never both. Pick one.
No PodDisruptionBudgets — Karpenter’s consolidation can evict all your pods at once.
Aggressive scale-up + slow scale-down — flapping. Set stabilization windows.
Scaling on CPU only for memory-bound apps — scale on memory, or RPS, or queue depth.
Setting requests too low — HPA computes utilization as usage / request. If request is too low, you always look at 100% utilization.

The decision tree

Q: What's the workload pattern?
│
├── Steady load, scales predictably     → HPA (CPU/memory)
│
├── Bursty load, event-driven           → KEDA (queue, cron, etc.)
│
├── Latency-sensitive, can scale out    → HPA (RPS, custom metric)
│
├── Right-sizing new workload           → VPA (Off mode, then manual)
│
└── Need to add/remove nodes            → Karpenter (modern) or CA (legacy)

PodDisruptionBudgets — the safety net

PDB is required when you have any autoscaling or disruption happening. It tells Kubernetes: “during voluntary disruption, keep at least N pods running.”

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2          # always keep 2 pods
  # or
  maxUnavailable: 1        # never have more than 1 down
  selector:
    matchLabels:
      app: web

This is critical for Karpenter consolidation. Without PDB, Karpenter can evict all your pods simultaneously, causing a brief outage.

Metrics to monitor

kube_horizontalpodautoscaler_status_current_replicas — HPA’s current target
kube_horizontalpodautoscaler_status_desired_replicas — what HPA wants
cluster_autoscaler_nodes_total / Karpenter equivalent — node count
Pod scheduling latency — how long pending pods wait
Scale-up/down events — alert if HPA is flapping

Common gotchas

HPA needs metrics-server. Without it, HPA shows <unknown>/70% and doesn’t scale.
HPA can only scale on metrics it can read. CPU/memory is built-in. RPS, queue depth, etc. need adapters.
HPA won’t scale to zero by default. Use KEDA for that.
VPA + HPA conflict on the same metric. Don’t run both on CPU/memory.
Karpenter’s consolidation is aggressive. Always set PDBs.
Cluster Autoscaler takes minutes to scale up. Plan for it. Karpenter is faster.
Spot instances + stateful workloads is risky. Use StatefulSets carefully with spot.
Resource requests are the foundation. Without them, HPA is useless, VPA is guessing, and the scheduler doesn’t know what to do.
minReplicas: 0 requires a custom metrics adapter that supports scaling to zero (KEDA, KNative, etc.). Default HPA cannot scale to zero.
HPA and PDB don’t always play nicely. If HPA scales down to minReplicas and PDB says minAvailable: minReplicas, the system can deadlock during voluntary disruption. Set PDB conservatively.
Karpenter doesn’t manage existing node groups — only the nodes it provisions. Mixed clusters are fine but be aware.

A worked example

Cluster: 100 RPS average, 1000 RPS peak. Stateless web service.

# HPA: scale on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target: { type: Utilization, averageUtilization: 70 }
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300

# Deployment: requests match HPA's math
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: web
        image: myorg/web:v1
        resources:
          requests:
            cpu: 500m    # each pod wants half a core
            memory: 512Mi
          limits:
            cpu: 1
            memory: 1Gi

# Karpenter NodePool: add nodes when pods don't fit
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["c", "m"]
      nodeClassRef:
        name: default
  limits:
    cpu: "200"
    memory: 800Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

# PDB: keep at least 3 pods during disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: web

Together: at low load, 5 pods on 2 nodes. At 1000 RPS, HPA scales to ~30 pods, Karpenter adds 4-6 nodes. Karpenter’s consolidation moves pods to fewer, larger nodes during low-load periods. PDB prevents Karpenter from killing all the pods at once.

cloudnative wiki

Explorer

Auto-Scaling

The four scalers

HPA — Horizontal Pod Autoscaler

The algorithm

Basic HPA

What HPA needs

Common HPA mistakes

HPA on custom / external metrics

VPA — Vertical Pod Autoscaler

Three modes

Basic VPA

When to use VPA

When NOT to use VPA

VPA + HPA: the right combination

Cluster Autoscaler (CA)

How it works

When to use CA

CA config

CA gotchas

Karpenter

How it works

Karpenter advantages over CA

Karpenter NodePool

When to use Karpenter

Karpenter gotchas

KEDA — Kubernetes Event-Driven Autoscaling

What KEDA can scale on

Example: Kafka lag-based scaling

When to use KEDA

KEDA gotchas

Putting it together

Anti-patterns to avoid

The decision tree

PodDisruptionBudgets — the safety net

Metrics to monitor

Common gotchas

A worked example

See also

Graph View

Table of Contents

Backlinks