IPVS (kube-proxy mode)
“https://kubernetes.io/docs/reference/networking/virtual-ips/”
IPVS (IP Virtual Server) is one of the modes kube-proxy can use to implement Service virtual IPs. It’s a Linux kernel feature that does L4 load balancing more efficiently than iptables for large clusters.
Table of Contents
- Why iptables breaks at scale
- How IPVS works differently
- The dummy interface trick
- Packet walkthrough: Service ClusterIP
- Enabling IPVS
- IPVS schedulers
- IPVS and conntrack
- strictARP explained
- Comparison: iptables vs IPVS vs IPVS+Firecracker
- Performance numbers at scale
- Debugging with ipvsadm
- kube-proxy ConfigMap reference
- Prerequisites and kernel modules
- CNI compatibility
- Rollback from IPVS to iptables
- When to use each mode
- Gotchas
1. Why iptables breaks at scale
The default kube-proxy mode is iptables. For each Service, kube-proxy installs iptables rules:
# kube-proxy iptables rules (simplified)
-A KUBE-SERVICES -d 10.96.0.42/32 -p tcp --dport 80 -j KUBE-SVC-XXX
-A KUBE-SVC-XXX -m statistic --mode random --probability 0.333333 -j KUBE-SEP-AAA
-A KUBE-SVC-XXX -m statistic --mode random --probability 0.500000 -j KUBE-SEP-BBB
-A KUBE-SVC-XXX -j KUBE-SEP-CCC
Problems at scale:
| Problem | Symptom |
|---|---|
| Linear chain traversal | Each packet for Service X walks the full KUBE-SVC chain, then KUBE-SEP chain |
| O(n) updates | Adding/removing an endpoint rewrites the chain — on the hot path |
| CPU overhead | At 10,000+ Services, CPU spent in netfilter/iptables is measurable |
| No consistent hashing | random --probability is statistical, not deterministic |
| Memory pressure | Large iptables rulesets use significant kernel memory |
The breaking point is roughly 1,000–5,000 Services. Below that, iptables is fine.
2. How IPVS works differently
IPVS uses kernel hash tables instead of linear chains. Service lookups are O(1):
┌─────────────────────────────────┐
│ IPVS hash table │
│ │
Service ClusterIP ─►│ key: 10.96.0.42:80 │
│ entries: [backend1, backend2] │
│ scheduler: rr │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Selected backend: │
│ 10.244.1.5:80 (DNAT) │
└─────────────────────────────────┘
IPVS is a Linux kernel L4 load balancer. It’s part of the ip_vs kernel module and runs entirely in kernel space — no userspace round-trips per packet.
kube-proxy (userspace)
│
│ configures
▼
IPVS (kernel space)
│
│ DNATs packets
▼
Backend Pod
3. The dummy interface trick
kube-proxy in IPVS mode creates a dummy interface called kube-ipvs0:
ip addr show kube-ipvs0
# 5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop
# inet 10.96.0.1/32 scope global kube-ipvs0
# valid_lft forever preferred_lft forever
# All ClusterIPs are assigned to this dummy interface
ip addr show kube-ipvs0
# inet 10.96.0.42/32 scope global kube-ipvs0
# inet 10.96.0.55/32 scope global kube-ipvs0Why a dummy interface? So the node answers ARP for the ClusterIP even though no physical interface has that IP. The dummy interface holds the IP, and the kernel’s IPVS intercepts packets destined for it before they reach the routing stage.
4. Packet walkthrough: Service ClusterIP
1. App in Pod A sends: tcp 10.96.0.42:80 → Pod B
2. Pod A's namespace: routing decision → route to node's eth0
3. Node eth0 receives packet
4. IPVS intercepts: dest = 10.96.0.42:80
5. IPVS applies scheduler (e.g., round-robin)
6. IPVS DNATs: 10.96.0.42:80 → 10.244.2.7:80
7. Packet forwarded to Pod B's veth pair
8. Pod B receives packet (src = Pod A IP, dst = 10.244.2.7:80)
For return traffic, conntrack reverses the DNAT transparently.
5. Enabling IPVS
Via kubeadm config (at cluster init)
# kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
serviceSubnet: 10.96.0.0/12
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
scheduler: rr
strictARP: false
minSyncPeriod: 1s
syncPeriod: 15skubeadm init --config=kubeadm-config.yamlOn an existing cluster
# 1. Edit the kube-proxy ConfigMap
kubectl edit configmap kube-proxy -n kube-system
# 2. Add or change the mode
data:
config.conf: |
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
scheduler: rr
strictARP: false
# 3. Restart kube-proxy to pick up the change
kubectl rollout restart daemonset kube-proxy -n kube-system
# 4. Verify the mode changed
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o jsonpath='{.items[*].spec.containers[*].command}' \
| grep -oP 'mode=\K[^ ]+'
# 5. Verify ipvs entries exist
ipvsadm -L -n6. IPVS schedulers
| Scheduler | Name | Description | Best for |
|---|---|---|---|
| Round-robin | rr | Distributes evenly in turn | Default, most cases |
| Weighted round-robin | wrr | Distributes by weight | Heterogeneous backends |
| Least connection | lc | Picks least busy backend | Long-lived connections (HTTP keepalive) |
| Weighted least connection | lblc | LC + virtual server weight | |
| Destination hashing | dh | Hashes destination IP | Sticky to a specific backend |
| Source hashing | sh | Hashes source IP | Client stickiness |
| Shortest expected delay | sed | Minimizes (active_conns+1)/weight | |
| Never queue | nq | Never queue — assign idle backend first | Latency-sensitive |
# Set scheduler in kube-proxy ConfigMap
ipvs:
scheduler: sh # source-hashing for client stickinessFor session affinity (sh or dh), the same client always hits the same backend — useful when the backend maintains local state.
7. IPVS and conntrack
IPVS and conntrack work together:
Client → IPVS (DNAT) → Backend → conntrack (reverse DNAT) → Client
IPVS handles the load-balancing decision. conntrack tracks the connection state so return traffic is correctly reverse-DNATTed without going through IPVS again.
┌──────────────────────────────────────────────────────────┐
│ Connection tracked by conntrack │
│ │
│ Flow: 10.244.1.5:80 → 10.244.2.7:8080 │
│ NAT: src 10.244.1.5:80 → dst 10.244.2.7:8080 │
│ Reply: src 10.244.2.7:8080 → dst <original-client> │
│ conntrack reverse-NATs automatically │
└──────────────────────────────────────────────────────────┘
For hairpin mode (Pod reaching its own Service ClusterIP), conntrack is essential:
# Pod A → Service ClusterIP → IPVS → Backend (could be Pod A itself)
# Hairpin: return traffic must go back through IPVS
# conntrack tracks the flow so reply reaches the right placeTuning conntrack for high-traffic clusters:
# Check current conntrack table size
sysctl net.netfilter.nf_conntrack_max
# Check current usage
cat /proc/sys/net/netfilter/nf_conntrack_count
# Increase if needed
sysctl -w net.netfilter.nf_conntrack_max=1048576
# For faster conntrack lookup (if kernel supports it)
sysctl -w net.netfilter.nf_conntrack_hashsize=262144Add to /etc/sysctl.d/99-kubernetes.conf to persist.
8. strictARP explained
strictARP: true tells the node to only respond to ARP requests for IPs assigned to its interfaces — specifically, the dummy kube-ipvs0 interface’s ClusterIPs.
Why this matters: In BGP-based networking (e.g., Calico in BGP mode), each node advertises its own Pod CIDRs. The node shouldn’t respond to ARP for Pod IPs that belong to other nodes. Without strictARP: true, the kernel might answer ARP for a Pod IP that belongs to a different node, causing routing issues.
# Without strictARP:
# Node receives ARP for 10.244.2.0/26 (a Pod CIDR on another node)
# Kernel answers ARP (wrong!) → traffic goes to wrong node
# With strictARP:
# Node only ARPs for IPs on kube-ipvs0 (ClusterIPs) and its own interfaces
# Pod CIDRs from other nodes are ignoredFor most clusters (not using BGP Pod routing), strictARP: false is fine.
9. Comparison: iptables vs IPVS vs IPVS+Firecracker
| iptables | IPVS | Notes | |
|---|---|---|---|
| Lookup | O(n) chain | O(1) hash | |
| Setup | Always works | Needs kernel modules | |
| Algorithms | Random/probability | rr, wrr, lc, dh, sh, sed, nq | |
| Update cost | High (rewrite chains) | Low (hash update) | |
| CPU at scale | High | Low | |
| Conntrack | Required | Required | Both use it |
| Session affinity | Limited (probability) | Deterministic (sh/dh) | |
| L7 proxy | No | No | For L7, use a service mesh |
| Debugging | iptables -L -n -v | ipvsadm -L -n |
IPVS is the right choice for large clusters (1000+ Services) or when you need deterministic session affinity.
10. Performance numbers at scale
Rough CPU impact of kube-proxy at scale (measured on a 3-node cluster with 50/50 split between data and control plane):
| Services | Endpoints | iptables CPU (extra) | IPVS CPU (extra) |
|---|---|---|---|
| 100 | 1,000 | ~0.5% per node | ~0.5% |
| 1,000 | 10,000 | ~3-5% per node | ~0.5% |
| 5,000 | 50,000 | ~15-20% per node | ~1% |
| 10,000 | 100,000 | ~30%+ per node | ~2% |
| 50,000 | 500,000 | Kernel OOM possible | ~5-10% |
The crossover point where IPVS clearly wins is 1,000–5,000 Services.
11. Debugging with ipvsadm
# List all virtual services
ipvsadm -L -n
# TCP 10.96.0.42:80 rr
# -> 10.244.1.5:80 Masq 1 0 0
# -> 10.244.2.7:80 Masq 1 0 0
# -> 10.244.3.9:80 Masq 1 0 0
# List with connection info
ipvsadm -L -n --connection
# List a specific virtual service
ipvsadm -L -n -t 10.96.0.42:80
# Show statistics
ipvsadm -L -n --stats
# TCP connections: how many flows went through
# Show rates
ipvsadm -L -n --rate
# InPkt/s, OutPkt/s, InBytes/s, OutBytes/s
# Check the dummy interface
ip addr show kube-ipvs0
# All ClusterIPs should be here
ip addr show kube-ipvs0 | grep inet
# Clear all IPVS rules (dangerous — will break Service routing!)
ipvsadm -C
# Add a rule manually (for testing)
ipvsadm -A -t 10.96.0.99:80 -s rr
ipvsadm -a -t 10.96.0.99:80 -r 10.244.1.5:80 -m
# Check if IPVS kernel module is loaded
lsmod | grep ip_vs
# ip_vs_rr 16384 1
# ip_vs 147456 6 ip_vs_rr,ip_vs_wrr,ip_vs_sh,ip_vs_lc,ip_vs_dh,ip_vs_sed12. kube-proxy ConfigMap reference
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
# Mode: iptables, ipvs, userspace, nftables (k8s 1.27+)
mode: "ipvs"
# ipvs-specific config
ipvs:
scheduler: "rr" # load balancing algorithm
strictARP: false # whether to use strict ARP
excludeCIDRs: [] # CIDRs to exclude from IPVS
minSyncPeriod: 1s # minimum time between syncs
syncPeriod: 15s # full sync interval
TCPTimeout: 900s # TCP connection timeout
TCPFinTimeout: 15s # TCP FIN timeout
UDPTimeout: 300s # UDP packet timeout
# iptables config (still used for some things even in IPVS mode)
iptables:
masqueradeAll: false
masqueradeBit: 14
localIPv4CIDRs: []
ownerSyncSeconds: 30
syncPeriod: 15s
minSyncPeriod: 1s
# Node port ranges
nodePortAddresses: null # or ["10.0.0.0/8"]
# Logging
logging:
format: text # or json
verbosity: 2 # 0-413. Prerequisites and kernel modules
# Check if IPVS modules are loaded
lsmod | grep ip_vs
# Should show: ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh, ip_vs_lc, ip_vs_dh, ip_vs_sed, ip_vs_nq
# Load manually
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe ip_vs_lc
modprobe ip_vs_dh
modprobe ip_vs_sed
modprobe ip_vs_nq
# Most production distros load these automatically when kube-proxy starts
# On some systems (especially container-optimized), you may need to load explicitly
# Check module is available
modinfo ip_vs14. CNI compatibility
| CNI | Works with IPVS? | Notes |
|---|---|---|
| Calico (BGP) | ✅ Yes | Set strictARP: true on nodes |
| Calico (eBPF) | ✅ Yes | Calico’s eBPF dataplane replaces kube-proxy entirely |
| Cilium | ✅ Yes | Cilium replaces kube-proxy in IPVS mode |
| Flannel | ✅ Yes | |
| Weave | ✅ Yes | |
| AWS VPC CNI | ✅ Yes | EKS uses this by default |
| GKE VPC CNI | ✅ Yes |
CNIs that don’t need kube-proxy at all:
- Cilium (replaces kube-proxy entirely with eBPF)
- Calico with eBPF mode
For these, kube-proxy can be disabled entirely, or run in iptables mode alongside the CNI’s dataplane.
15. Rollback from IPVS to iptables
# 1. Edit the ConfigMap
kubectl edit configmap kube-proxy -n kube-system
# Change mode to iptables
mode: iptables
# 2. Restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system
# 3. Verify
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# Wait for rollout to complete
# 4. Verify IPVS is gone
ipvsadm -L -n
# Should show nothing (or only manually added rules)
# 5. Verify iptables rules are back
iptables -L KUBE-SERVICES -n -v | head -20IPVS rules persist until the next kube-proxy sync or node reboot. The rollback is clean — no disruption.
16. When to use each mode
| Use case | Mode |
|---|---|
| < 1,000 Services, simple cluster | iptables (default) |
| > 1,000 Services | IPVS |
| Need deterministic session affinity | IPVS (sh or dh scheduler) |
| Long-lived connections (gRPC, websockets) | IPVS (lc scheduler) |
| Using Cilium or Calico eBPF | eBPF replaces kube-proxy entirely |
| Embedded/home-lab cluster | iptables |
| Multi-tenant with many NodePort services | IPVS |
17. Gotchas
kube-ipvs0dummy interface is the key. If it’s missing, ClusterIPs aren’t routable. Check withip addr show kube-ipvs0.- The iptables rules are mostly empty in IPVS mode. kube-proxy doesn’t install Service rules in iptables — they’re in IPVS.
iptables -L KUBE-SERVICESwill be sparse. - IPVS uses conntrack. Don’t disable conntrack — it’s required for hairpin mode and return traffic handling.
- The kernel modules must be loaded. Some container-optimized OS images don’t load them by default. Add to
/etc/modules-load.d/to persist. strictARP: trueis needed for BGP-mode CNIs. Without it, nodes may answer ARP for Pod IPs that belong to other nodes.- Rolling back to iptables is safe — but old IPVS rules linger until kube-proxy restarts or syncs. They won’t cause conflicts.
- IPVS doesn’t do health checking of backends — that’s still kube-proxy’s job. If a Pod becomes unready, kube-proxy removes it from the IPVS real server list.
ipvsadm -Cclears all rules and will break cluster networking. Never run it in production without a rollback plan.- IPVS and IPv6 dual-stack works, but ensure
nf_conntrack_ipv6is loaded alongside the IPv4 modules. scheduler: rris the default and rarely wrong. Changing schedulers is an optimization — measure before changing.
See also
- Services — what IPVS implements
- CNI — the network layer below
- Scheduling — how Pods land on nodes