Cluster Autoscaler (CA)
“https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler”
The Cluster Autoscaler (CA) adjusts the size of a cluster’s node groups based on the number of unschedulable Pods and the utilization of existing nodes. It’s the older, more conservative alternative to Karpenter — predefines node groups, scales each between min and max, and works on every cloud.
Table of Contents
- What CA Solves
- CA vs Karpenter
- Architecture
- Node Groups and ASGs
- The Scale-Up Logic
- The Scale-Down Logic
- Configuration
- Cloud-Specific Integration
- Spot, GPU, and Heterogeneous Workloads
- Migration to Karpenter
- Operations and Debugging
- Gotchas and Common Mistakes
1. What CA Solves
When a Pod can’t be scheduled because no node has enough resources, you need more nodes. CA watches for this state and adds nodes automatically.
When a node is underutilized and its Pods can fit on other nodes, you can remove it to save money. CA does this too.
Cluster Autoscaler (in cluster)
│
│ Watches
▼
Pending Pods (unschedulable)
│
│ "I see 5 Pending Pods that don't fit on any current node"
│ "I see that I can launch an m5.large via the node group"
│
│ Calls
▼
Cloud Autoscaling API (ASG, MIG, VMSS)
│
│ "Add 1 m5.large to the node group"
│
│ New instance launches, joins cluster
│ Pods get scheduled
CA is mature, well-known, and works on every cloud. It’s not as fast or efficient as Karpenter, but it’s the safe default for many setups.
2. CA vs Karpenter
See Karpenter for the full comparison. Quick summary:
| CA | Karpenter | |
|---|---|---|
| Model | Node groups with min/max | Pod-driven, dynamic instance selection |
| Cold start | 2-3 min | 30-60s |
| Instance diversity | Per node group | Across the whole NodePool |
| Consolidation | Conservative | Aggressive |
| Cloud support | All | AWS first; GKE, Azure in progress |
| Recommendation | Stable, mature | New clusters, dynamic workloads |
Pick CA if:
- You have a stable workload pattern and want predictable node group sizes.
- You’re on a cloud where Karpenter isn’t ready (e.g. on-prem, some clouds).
- You need mature, well-tested behavior.
Pick Karpenter if:
- You have heterogeneous workloads (different instance types needed).
- You want fast scale-up (30s vs 3 min).
- You’re on AWS.
3. Architecture
CA runs as a single Deployment in the cluster (with leader election for HA):
┌────────────────────────────────────────────────────────────┐
│ cluster-autoscaler Deployment │
│ │
│ - Leader-elected (1 active, 1+ standby) │
│ - Watches Pods, Nodes, node group sizes │
│ - Decides scale-up / scale-down │
│ - Calls cloud APIs (ASG, MIG, VMSS) │
└────────────────────────────────────────────────────────────┘
▲
│ Reads cloud config from flags
│
▼
Cloud provider's autoscaling API
- AWS: ASG (Auto Scaling Group)
- GCP: MIG (Managed Instance Group)
- Azure: VMSS (Virtual Machine Scale Set)
CA is a single binary (no operator, no CRDs, no webhooks). Config is via flags and a ConfigMap. State is in the apiserver and the cloud.
3.1 The CA image
CA is shipped as a container image. The cluster runs it as a Deployment with:
- ServiceAccount with permission to read Pods, Nodes, and update node group sizes.
- IAM role (cloud) with permission to call the autoscaling API.
- ConfigMap (optional) for some tunings.
4. Node Groups and ASGs
CA scales node groups (cloud concept) up and down. Each node group has a min, max, and the instance type.
4.1 AWS: Auto Scaling Group (ASG)
resource "aws_autoscaling_group" "workers" {
name = "k8s-workers"
min_size = 2
max_size = 20
desired_capacity = 2
vpc_zone_identifier = [var.subnet_a, var.subnet_b]
launch_template {
id = aws_launch_template.workers.id
version = "$Latest"
}
tag {
key = "k8s.io/cluster-autoscaler/enabled"
value = "true"
propagate_at_launch = false
}
# ...
}The tags k8s.io/cluster-autoscaler/<cluster-name>: owned tell CA “this ASG is yours to scale.”
4.2 GCP: Managed Instance Group (MIG)
resource "google_container_node_pool" "workers" {
name = "workers"
cluster = google_container_cluster.primary.name
node_count = 1
autoscaling {
min_node_count = 1
max_node_count = 20
}
management {
auto_repair = true
auto_upgrade = true
}
}GKE’s node pool is the MIG. CA scales it via the GKE API.
4.3 Azure: Virtual Machine Scale Set (VMSS)
resource "azurerm_kubernetes_cluster_node_pool" "workers" {
name = "workers"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D2s_v5"
node_count = 1
auto_scaling_enabled = true
min_count = 1
max_count = 20
}AKS’s node pool is the VMSS. CA scales it.
4.4 The min/max pattern
A node group has min, max, and CA scales between them. The actual size is desired_capacity plus CA’s adjustments.
min_size = 2 # CA will never go below 2
max_size = 20 # CA will never go above 20
desired_capacity = 2 # initial sizeMin is a floor — CA won’t scale down below it. Max is a ceiling — CA won’t scale up above it. Both are operational levers you set deliberately.
5. The Scale-Up Logic
CA scans for unschedulable Pods every --scan-interval (default 10s):
- Find unschedulable Pods. Pods that have been Pending for at least
--max-node-provision-time(default 15 min, but configurable). - Simulate adding nodes from each node group. For each group, assume one node is added; see how many Pods would be scheduled.
- Pick the best node group based on the configured
expander:least-waste— minimizes wasted resources (default)priority— uses priority list (from a ConfigMap)random— randommost-pods— maximizes Pods scheduledprice— cheapest first (only on AWS)cheapest— also cheapest
- Call the cloud API to add the node.
- Wait for the node to join. The kubelet registers, the CNI sets up networking.
- Pods get scheduled.
The total time from “Pending Pod detected” to “Pod scheduled on a new node” is typically 2-3 minutes (ASG launch + boot + kubelet registration + CNI setup).
5.1 The “Pending for 15 min” gotcha
By default, CA ignores Pods that have been Pending for less than --max-node-provision-time (default 15 min). This prevents thrashing on brief load spikes.
For latency-sensitive scaling, lower this:
# cluster-autoscaler flags
- --max-node-provision-time=2mOr set on a per-Pod basis with the cluster-autoscaler.kubernetes.io/pod-scale-up-delay annotation.
5.2 The “bin-packing is poor” issue
CA simulates adding one node of a known type. If your workloads are heterogeneous, you’ll end up with:
- A
c5.4xlargenode for the GPU workload (wasting 30 of 32 cores). - An
m5.largenode for the small service. - A separate
r5.2xlargenode for the memory-hungry service.
Karpenter is better at this — it can pick any instance type for any Pod.
6. The Scale-Down Logic
CA scans for underutilized nodes every --scan-interval:
- Find nodes that are candidates for removal. A node is a candidate if:
- It has been running for at least
--scale-down-delay-after-add(default 10 min). - It has been underutilized for at least
--scale-down-unneeded-time(default 10 min). - All its Pods can be rescheduled on other nodes.
- It has been running for at least
- Simulate removing the node. Re-schedule each Pod on remaining nodes (using the scheduler’s actual logic).
- If all Pods can be rescheduled, drain and remove the node.
A node is “underutilized” if the sum of its Pods’ requests is less than 50% of the node’s capacity. This threshold is configurable with --scale-down-utilization-threshold.
6.1 The PDB interaction
CA respects PodDisruptionBudgets. If a node’s Pods can’t be rescheduled without violating a PDB, the node isn’t a scale-down candidate.
# PDB
spec:
minAvailable: 2
# Deployment
spec:
replicas: 2With PDB: minAvailable=2 and replicas=2, CA can’t drain the node (would violate the PDB). The node is never scaled down.
This is a common CA deadlock — a tight PDB + low replicas = stuck nodes.
6.2 The drain process
When CA decides to remove a node:
- Cordon the node (no new Pods).
- Evict Pods one at a time (respects PDBs).
- Wait for the Pods to terminate (respects
terminationGracePeriodSeconds). - Call the cloud API to remove the instance.
The total time is --max-graceful-termination-sec (default 600s = 10 min). If a Pod doesn’t terminate in time, the termination is forced (Pod deleted, but the instance waits for --max-graceful-termination-sec total).
7. Configuration
CA’s configuration is mostly flags to the Deployment:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.0
command:
- ./cluster-autoscaler
- --v=4
- --cloud-provider=aws
- --cluster-name=my-cluster
- --region=us-east-1
- --expander=least-waste
- --balance-similar-node-groups=true
- --max-node-provision-time=2m
- --scale-down-delay-after-add=5m
- --scale-down-unneeded-time=5m
- --scale-down-utilization-threshold=0.5
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=true7.1 Key flags
| Flag | Default | What it does |
|---|---|---|
--cloud-provider | (required) | aws, gce, azure, digitalocean, etc. |
--cluster-name | (required) | Used to find the right node groups |
--expander | least-waste | How to choose which node group to scale up |
--max-node-provision-time | 15m | Ignore Pods Pending for less time |
--scale-down-delay-after-add | 10m | Don’t scale down nodes younger than this |
--scale-down-unneeded-time | 10m | Don’t scale down until node has been unneeded for this long |
--scale-down-utilization-threshold | 0.5 | Below this fraction of capacity, the node is “unneeded” |
--balance-similar-node-groups | false | Try to keep similar node groups balanced |
--skip-nodes-with-local-storage | false | Skip nodes with emptyDir / hostPath (Pods can’t be rescheduled) |
--skip-nodes-with-system-pods | true | Skip nodes with kube-system Pods (don’t drain control plane) |
7.2 The expander
The expander chooses which node group to scale up when there are multiple choices:
least-waste(default) — minimizes wasted CPU/memory. Best for cost.priority— uses a ConfigMap with priorities.random— picks randomly.most-pods— picks the group that schedules the most Pods.price(AWS only) — picks the cheapest ASG.cheapest(AWS only) — also cheapest.
The priority expander needs a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
10:
- .*t2\.large.*
50:
- .*m5\..*
90:
- .*p3\..* # GPU nodes lastHigher priority = scaled up first.
8. Cloud-Specific Integration
8.1 AWS
# IAM policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeImages",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups"
],
"Resource": "*"
}
]
}8.2 GCP
GKE has built-in CA support. The node pool is the MIG.
8.3 Azure
AKS has built-in CA support. The node pool is the VMSS.
9. Spot, GPU, and Heterogeneous Workloads
9.1 Spot
CA supports Spot via mixed instance policy in the ASG. The ASG has multiple instance types; CA picks the cheapest.
For interruption handling, you need a separate piece (Karpenter is better at this).
9.2 GPU
GPU workloads need a separate node group with the right instance type (e.g. p3.2xlarge). The Pod has a nodeSelector for instance-type: p3.2xlarge (or the equivalent), and CA scales that group.
resource "aws_autoscaling_group" "gpu" {
name = "k8s-gpu"
# ...
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 0
on_demand_percentage_above_base_capacity = 0
spot_allocation_strategy = "lowest-price"
}
}
}9.3 Heterogeneous workloads
The CA approach:
- One node group per workload type (small / medium / large / GPU / ARM).
- Each with its own min / max.
- Pods use
nodeSelectorornodeAffinityto land on the right group.
This works but creates node group sprawl. Karpenter handles this with a single NodePool + requirements.
10. Migration to Karpenter
The standard migration:
- Install Karpenter alongside CA. Don’t disable CA yet.
- Create a NodePool that mirrors one of the existing ASGs (instance types, AZs, capacity type).
- Cordon the existing ASG’s nodes so new Pods don’t go there.
- Test — let Karpenter handle new Pods, scale the old ASG down.
- Repeat for other ASGs.
- Delete CA once all ASGs are migrated.
- Let Karpenter consolidate — old nodes get replaced with Karpenter-launched ones.
Don’t run both CA and Karpenter at full scale. They race for the same Pods.
11. Operations and Debugging
11.1 Common commands
# check CA
kubectl -n kube-system get pods -l app=cluster-autoscaler
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=100
# check node group status
kubectl -n kube-system get configmap cluster-autoscaler-status -o yaml
# shows "NodeGroupHealth" sections
# describe a node
kubectl describe node <name>
# look at "Annotations" for cluster-autoscaler.kubernetes.io/* fields11.2 The “CA didn’t scale up” checklist
# 1. Are there unschedulable Pods?
kubectl get pods -A | grep Pending
# 2. Is the Pod Pending for > max-node-provision-time?
# default is 15 min
kubectl get pod <pod> -o jsonpath='{.metadata.creationTimestamp}'
# 3. Can the Pod fit on any existing node?
kubectl describe pod <pod>
# look at events for "FailedScheduling"
# 4. Are the node groups' max sizes hit?
# check the cloud's autoscaling console
# 5. Is CA running and leader-elected?
kubectl -n kube-system logs -l app=cluster-autoscaler | grep -i leader
# 6. Are there enough IP addresses in the subnets?
# (for VPC CNI, the most common cause of stuck scale-up)11.3 The “CA didn’t scale down” checklist
# 1. Are the nodes too young?
# check --scale-down-delay-after-add (default 10m)
# 2. Are the Pods on the node reschedulable?
kubectl describe node <node>
# look at the Pods list
# 3. Are PDBs blocking?
kubectl get pdb -A
# a tight PDB can prevent scale-down
# 4. Is the node running system Pods?
# CA skips nodes with kube-system Pods (--skip-nodes-with-system-pods=true)12. Gotchas and Common Mistakes
12.1 The 20+ common mistakes
-
CA and Karpenter at the same time. They race. Pick one.
-
--max-node-provision-timeis 15 min by default. Pods Pending for less time are ignored. For latency-sensitive scaling, lower this. -
Tight PDBs + low replicas = CA stuck. A PDB with
minAvailable: 2and a Deployment withreplicas: 2blocks scale-down forever. -
CA doesn’t drain nodes with
hostPathmounts by default. Set--skip-nodes-with-local-storage=falseif you want to drain them. -
CA doesn’t drain nodes with
kube-systemPods by default. Set--skip-nodes-with-system-pods=falseif you want to. -
CA scans every 10s but waits 10 min before scaling down. A burst of low utilization doesn’t trigger scale-down.
-
CA’s scale-down is “conservative”. It only removes nodes that are clearly underutilized, with headroom for the remaining Pods to grow.
-
Mixed instance policy in the ASG vs single instance type. Mixed gives more Spot diversification but CA’s bin-packing is less predictable.
-
CA doesn’t handle the case where one ASG is in a different region. All ASGs must be in the same region.
-
The leader election means only one CA is active. If the leader dies, a standby takes over after ~30s. No scale events during the transition.
-
CA’s metrics are limited. It exports Prometheus metrics on
:8085. Scrape them, but they’re not as detailed as Karpenter’s. -
CA’s IAM role needs
autoscaling:SetDesiredCapacityandautoscaling:TerminateInstanceInAutoScalingGroup. Without these, CA logs “access denied” and does nothing. -
The cluster name flag must match the ASG’s
k8s.io/cluster-autoscaler/<cluster-name>tag. Mismatch = CA ignores the ASG. -
CA doesn’t manage system node groups (control plane, etcd). It scales worker node groups only.
-
A node in an ASG with
desired_capacity: 0won’t be managed by CA. Set min to 1 or higher. -
CA’s
--max-graceful-termination-secis the upper bound for drain time. If a Pod hasterminationGracePeriodSeconds: 600, CA waits up to 600s for the drain. -
CA’s expander order is fixed per cluster. You can’t have different expanders for different node groups.
-
CA is per-cluster, not multi-cluster. Multi-cluster is out of scope.
-
CA’s logs are verbose at
--v=4. Use--v=2in production for less noise, and bump to 4-6 for debugging. -
CA doesn’t have a CRD API. All config is flags. Changing a setting requires a Deployment rollout.
-
CA’s
expander: priorityConfigMap must be namedcluster-autoscaler-priority-expanderinkube-system. Different name = ignored. -
CA doesn’t auto-discover new ASGs. If you add a new ASG, you don’t need to restart CA — it picks them up on the next scan. But the tag must be set.
-
The AWS EKS managed node groups have built-in CA integration. The node group’s min/max is what CA uses.
-
CA’s
--balance-similar-node-groupsdistributes scale-up across similar groups. Without it, one group gets all the scale-up until it hits max. -
CA’s log format is a bit cryptic. The lines are usually
scaleUp: group X, node Y, reason Z, .... Parse them carefully.
See also
- Karpenter — the modern alternative
- Scaling — L06 overview
- Cluster Autoscaler on EKS — EKS-specific install
- Karpenter on EKS — EKS-specific install