Extended Resources and Device Plugins
“https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/”
Extended resources are opaque resources beyond CPU and memory — GPUs, FPGAs, InfiniBand HCAs, SR-IOV NICs, and custom hardware. They’re reported by device plugins running on the nodes, scheduled by the standard scheduler (via the NodeResourcesFit plugin), and consumed by Pods via the resources.limits field.
Table of Contents
- What Extended Resources Solve
- The Device Plugin Model
- Built-in and Common Extended Resources
- Reporting Resources: The Node’s View
- Consuming Resources: The Pod’s View
- GPU Scheduling Patterns
- Time-Slicing and MIG (GPU Sharing)
- Custom Device Plugins
- ResourceClaim and ResourceClaimTemplate (k8s 1.28+)
- Extended Resources and Quotas
- Operations and Debugging
- Gotchas and Common Mistakes
1. What Extended Resources Solve
Standard k8s resources (CPU, memory, ephemeral-storage) are not enough for many workloads:
- ML / AI — needs GPUs.
- HPC — needs FPGAs, InfiniBand, custom interconnects.
- Telco / NFV — needs SR-IOV NICs, DPDK, hardware accelerators.
- Storage — needs high-performance local NVMe.
Extended resources let nodes advertise what they have (e.g. “I have 4 NVIDIA A100 GPUs”), and Pods request what they need (e.g. “give me 1 GPU”). The scheduler places the Pod on a node that has the resource.
Node: Pod: Scheduler:
- cpu: 64 - cpu: 4 "Pod wants 1 GPU,
- memory: 256Gi - memory: 32Gi only node-2 has 1 free,
- nvidia.com/gpu: 4 - nvidia.com/gpu: 1 schedule there"
2. The Device Plugin Model
A device plugin is a gRPC daemon that runs on the node (as a DaemonSet, usually) and registers the resources with the kubelet.
Node
├── kubelet
│ │
│ └── gRPC (ListAndWatch)
│ ▲
│ │
│ ┌────────┴────────┐
│ │ device plugin │
│ │ (DaemonSet) │
│ │ │
│ │ NVIDIA GPU │
│ │ Operator │
└─────────────────────┘
The flow:
- The device plugin starts on the node.
- It registers itself with the kubelet via the Registration gRPC API.
- The kubelet calls
ListAndWatchto get the available devices. - The kubelet exposes the resources to the apiserver via the Node’s
status.allocatableandstatus.capacity. - The scheduler sees the resources and can place Pods.
- When a Pod uses a resource, the device plugin’s
AllocategRPC method is called, which prepares the device for the container (e.g. sets up CUDA libraries, mounts devices). - The kubelet mounts the device into the container.
The device plugin API is a stable gRPC interface defined in k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/. Vendors write plugins in any language that supports gRPC (Go is most common).
2.1 The registration handshake
// pseudo-Go
func (m *MyDevicePlugin) Register() error {
conn, err := grpc.Dial(
filepath.Join(devicePluginPath, "kubelet.sock"),
grpc.WithInsecure(),
)
defer conn.Close()
client := pluginapi.NewRegistrationClient(conn)
req := &pluginapi.RegisterRequest{
Version: pluginapi.Version,
Endpoint: "my-device-plugin.sock",
ResourceName: "example.com/foo",
Options: &pluginapi.DevicePluginOptions{},
}
_, err = client.Register(ctx, req)
return err
}The plugin creates a Unix socket, the kubelet calls Register, then ListAndWatch for the device list.
3. Built-in and Common Extended Resources
3.1 CPU and memory
Standard, not extended. The kubelet reports them automatically.
3.2 NVIDIA GPUs
The NVIDIA Device Plugin for Kubernetes. DaemonSet that detects NVIDIA GPUs and registers them as nvidia.com/gpu.
# node status
status:
allocatable:
nvidia.com/gpu: 4
capacity:
nvidia.com/gpu: 4The plugin also:
- Sets up the NVIDIA container runtime.
- Mounts CUDA libraries.
- Configures the GPU for the container.
3.3 Other accelerators
| Resource | Vendor / Project | Common use |
|---|---|---|
nvidia.com/gpu | NVIDIA | ML, AI, CUDA workloads |
amd.com/gpu | AMD | ML on AMD GPUs (ROCm) |
intel.com/gpu | Intel | Integrated GPU, Intel Xe |
nvidia.com/mig-1g.5gb | NVIDIA MIG | Multi-Instance GPU partitioning |
nvidia.com/gpu.shared | Time-slicing | Multiple Pods sharing one GPU |
hugepages-1Gi | (built-in) | Huge page allocation |
example.com/infiniband | Custom | InfiniBand HCA |
example.com/fpga | Intel / Xilinx | FPGA workloads |
3.4 Huge pages
Huge pages are a built-in extended resource, but they work differently from plugin-reported resources. They’re declared at the kubelet level (per node) and reported as hugepages-2Mi or hugepages-1Gi.
# kubelet flag
--hugepages-1Gi=4
# Pod
resources:
requests:
hugepages-1Gi: 2Gi
limits:
hugepages-1Gi: 2GiThe Pod’s container is allocated 2 huge pages (2 GiB of huge pages).
4. Reporting Resources: The Node’s View
The kubelet reports the node’s resources in status.allocatable and status.capacity:
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
status:
capacity:
cpu: "64"
memory: 256Gi
nvidia.com/gpu: 4
hugepages-1Gi: 4Gi
allocatable:
cpu: 63500m # 63.5 cores (500m reserved)
memory: 250Gi # 250 GB (6 GB reserved)
nvidia.com/gpu: 4
hugepages-1Gi: 4Giallocatable is what the scheduler sees. capacity is the physical total.
4.1 Manually advertising resources
For some resources (mostly for testing), you can manually advertise them:
# if you have a custom resource that's not detected by a device plugin
curl -k -X POST https://<kubelet>:<port>/api/v1/nodes/<name>/capacityOr via the kubelet flag:
# not a real flag, but you can patch the Node object
kubectl patch node <name> -p '{"status":{"capacity":{"example.com/foo":"2"}}}'Manually advertised resources are not consumed by Pods — the kubelet doesn’t know how to actually give them to containers. Use a device plugin for real resources.
5. Consuming Resources: The Pod’s View
A Pod requests an extended resource via resources.requests and resources.limits:
apiVersion: v1
kind: Pod
metadata: { name: ml-trainer }
spec:
containers:
- name: trainer
image: tensorflow/tensorflow:latest-gpu
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1For most extended resources, requests and limits are the same value — the resource is not compressible. You can’t ask for “0.5 GPUs”.
5.1 The scheduler’s view
The scheduler’s NodeResourcesFit plugin filters out nodes that don’t have the resource. For a Pod asking for nvidia.com/gpu: 1:
- A node with
nvidia.com/gpu: 0is dropped. - A node with
nvidia.com/gpu: 1is considered (and 1 is reserved). - A node with
nvidia.com/gpu: 4is considered (and 1 is reserved; 3 remain).
5.2 The kubelet’s allocation
When the Pod is scheduled, the kubelet:
- Calls the device plugin’s
AllocategRPC method. - The plugin returns the device IDs, environment variables, mount paths, etc.
- The kubelet sets up the container with the allocated devices.
For NVIDIA GPUs, this is:
- Mount the GPU device files (
/dev/nvidia0). - Set environment variables (
NVIDIA_VISIBLE_DEVICES=0). - Mount the NVIDIA libraries.
- Configure the container runtime for GPU access.
6. GPU Scheduling Patterns
6.1 One GPU per Pod (most common)
spec:
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: 1One Pod, one GPU. The Pod owns the GPU exclusively.
6.2 Multiple GPUs per Pod
spec:
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: 4For distributed training (e.g. 4 GPUs for one model). The Pod owns 4 GPUs.
6.3 GPU type selection
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: 1The Pod is scheduled only on nodes with A100 80GB GPUs. The GPU model is a node label — the device plugin or a custom controller sets it.
6.4 GPU taints for dedicated nodes
# taint GPU nodes so only GPU Pods land there
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule# GPU Pod tolerates the taint
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: 1This is a common pattern for mixed clusters (CPU + GPU nodes).
7. Time-Slicing and MIG (GPU Sharing)
GPUs are expensive. Time-slicing and MIG let multiple Pods share a single GPU.
7.1 Time-slicing
The NVIDIA device plugin’s time-slicing config lets multiple Pods use the same GPU in time slices (rapid context-switching). Each Pod sees the GPU, but they share the compute.
# ConfigMap for the NVIDIA device plugin
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
data:
config.yaml: |
version: v1
sharing:
timeSlicingConfig:
resources:
- name: nvidia.com/gpu
replicas: 4 # 4 Pods can share 1 GPUThe plugin advertises nvidia.com/gpu: 4 per physical GPU. A Pod asking for nvidia.com/gpu: 1 gets a time slice.
Time-slicing is not isolation. Two Pods on the same GPU still share the GPU’s memory and compute. They’re rapidly swapped. For workloads that don’t fully use the GPU (inference, light training), this works. For heavy compute (large model training), it doesn’t.
7.2 MIG (Multi-Instance GPU)
A100, H100 GPUs support hardware partitioning into multiple isolated instances. Each MIG instance has its own memory, compute, and decoders. True hardware isolation.
# nvidia device plugin config for MIG
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
data:
config.yaml: |
version: v1
migStrategies:
nvidia.com/gpu: "mixed" # advertise MIG instances as separate resourcesThe plugin advertises nvidia.com/mig-1g.5gb (1 instance with 5 GB) etc. A Pod asks for a specific MIG instance.
spec:
containers:
- name: inference
resources:
limits:
nvidia.com/mig-1g.5gb: 1MIG is true isolation — each instance has its own memory and compute. Two Pods on MIG instances don’t interfere.
7.3 CUDA MPS
CUDA Multi-Process Service allows multiple processes to share a GPU more efficiently. Less common in k8s.
8. Custom Device Plugins
For custom hardware, you write a device plugin. The minimum is:
- A gRPC server implementing the
RegistrationandListAndWatchAPIs. - A way to allocate the device (the
AllocateAPI). - Deployment as a DaemonSet (one Pod per node).
The reference implementation is in k8s.io/dynamic-resource-allocation and the example plugin in the k8s source tree.
8.1 The allocation response
// what Allocate returns
type ContainerAllocateResponse struct {
Envs map[string]string // environment variables
Mounts []*Mount // files to mount
Devices []*DeviceSpec // device files
Annotations map[string]string
CDIDevices []*CDIDevice // CDI devices
}The kubelet applies these to the container. The plugin can set env vars, mount files, and expose devices.
8.2 Resource granularity
The plugin reports a resourceName (e.g. example.com/foo) and a count. The granularity is up to the plugin:
- A plugin that reports
example.com/foo: 1means “1 unit of foo”. - A plugin that reports
example.com/foo: 8means “8 units of foo”.
The Pod’s request must be a positive integer. The kubelet doesn’t know about partial units.
9. ResourceClaim and ResourceClaimTemplate (k8s 1.28+)
The Dynamic Resource Allocation (DRA) feature (alpha in 1.28, beta in 1.30) extends extended resources with first-class allocation objects.
9.1 ResourceClaim
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaim
metadata: { name: ml-claim }
spec:
resourceClassName: gpu-a100
allocationMode: WaitForFirstConsumerA ResourceClaim is a request for a resource. The scheduler matches it to an available device. The Pod references the claim by name.
9.2 The Pod view
spec:
containers:
- name: trainer
resources:
claims:
- name: ml-claimThe Pod asks for the claim. The scheduler allocates the claim to a device on a node, and the kubelet exposes the device to the container.
9.3 ResourceClaimTemplate (for StatefulSets)
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata: { name: gpu-claim }
spec:
metadata: { name: gpu-claim }
spec:
resourceClassName: gpu-a100A StatefulSet uses ResourceClaimTemplate to create one claim per replica. Each Pod gets its own GPU.
9.4 Why this exists
DRA is the next generation of extended resources:
- Structured claims (not just opaque integers).
- First-class scheduler support (the scheduler has a plugin for DRA).
- Class-based selection (multiple classes of GPU, with priorities).
- Init-time allocation (allocate when the Pod starts, not before).
DRA is still in alpha/beta as of 1.30. Adoption is early. The device plugin model is still the standard.
10. Extended Resources and Quotas
ResourceQuota supports extended resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"The namespace can have at most 8 GPUs requested across all Pods. The sum of Pods’ requests.nvidia.com/gpu must not exceed 8.
requests and limits are separate quotas. A Pod that requests 4 GPUs and limits 4 GPUs counts as 4 against each. A Pod that requests 1 and limits 4 counts as 1 against requests and 4 against limits.
11. Operations and Debugging
11.1 Common commands
# check a node's resources
kubectl describe node <name>
# look at "Allocated resources" and "Capacity"
# check Pod's resource requests
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'
# check the device plugin
kubectl -n kube-system get pods -l app=nvidia-device-plugin
kubectl -n kube-system logs -l app=nvidia-device-plugin --tail=100
# check the kubelet's view
# (on the node)
ls /var/lib/kubelet/plugins/ # registered plugins
ls /var/lib/kubelet/plugins_registry/ # plugin sockets11.2 The “GPU not allocated” checklist
# 1. Does the node have the resource?
kubectl describe node <gpu-node> | grep nvidia
# 2. Is the device plugin running?
kubectl -n kube-system get pods -l app=nvidia-device-plugin
# the Pod should be Running on the GPU node
# 3. Is the Pod requesting the right resource?
kubectl get pod <pod> -o yaml
# look for resources.limits."nvidia.com/gpu"
# 4. Is the device plugin registered with the kubelet?
# (on the node)
ls /var/lib/kubelet/plugins_registry/
# look for nvidia-gpu.sock
# 5. Check the kubelet logs
journalctl -u kubelet --since "5 minutes ago" | grep -i gpu11.3 The “GPU allocated but not visible to container” case
The Pod has nvidia.com/gpu: 1, the node has 4 GPUs, the Pod is running, but nvidia-smi inside the container shows no GPU.
# 1. Is the device plugin's Allocate response correct?
# check the device plugin logs
kubectl -n kube-system logs -l app=nvidia-device-plugin
# 2. Is the NVIDIA runtime configured?
# check the container runtime config
# (Docker: /etc/docker/daemon.json; containerd: /etc/containerd/config.toml)
# 3. Is the NVIDIA driver loaded on the node?
nvidia-smi # on the node
# should show the GPUs
# 4. Is the container image CUDA-enabled?
# some images don't have CUDA libraries12. Gotchas and Common Mistakes
12.1 The 20+ common mistakes
-
Extended resources are integer-only. You can’t ask for “0.5 GPUs”. Either 1 or 0.
-
A device plugin must run on every node that has the resource. If a node has 4 GPUs but no device plugin, the kubelet reports
nvidia.com/gpu: 0. The scheduler can’t place GPU Pods there. -
The device plugin must register with the kubelet before the kubelet reports the resource. A buggy plugin = a node with no resources.
-
requestsandlimitsfor extended resources should usually be equal. Unlike CPU/memory, you can’t “burst” past the request. -
Time-slicing is not isolation. Two Pods on the same time-sliced GPU still share the GPU’s memory. For ML training, this can cause OOM.
-
MIG requires MIG-enabled GPUs (A100, H100). Older GPUs don’t support MIG.
-
The device plugin’s
Allocateis a hot path. If the plugin is slow, Pod startup is slow. -
Custom resources (
example.com/foo) are opaque to the scheduler. The scheduler can’t tell what they are. It just counts them. -
A
ResourceQuotawith extended resources blocks Pods at admission. If the namespace is at quota, the Pod is rejected with “exceeded quota”. -
Device plugin updates restart containers. A plugin version bump can cause Pod evictions.
-
The kubelet’s
--feature-gates=DynamicResourceAllocation=truemust be set for DRA. Older kubelets don’t support it. -
DRA’s
ResourceClaimis in alpha/beta. Don’t depend on it for production until 1.32+ (likely GA). -
GPU Pods need the NVIDIA runtime. Without it, the container can’t access the GPU even if
nvidia.com/gpuis allocated. -
GPU taints need to be tolerated. A taint on a GPU node prevents non-GPU Pods from being scheduled there. This is correct, but easy to forget.
-
A Pod with
nvidia.com/gpu: 1doesn’t reserve the GPU’s memory. Time-sliced sharing can cause memory exhaustion. Use MIG or check memory usage. -
The kubelet reports
allocatabledifferently fromcapacity. System reserved (kubelet, kernel) is subtracted. A 4-GPU node might haveallocatable.nvidia.com/gpu: 3. -
A node without the device plugin can still be scheduled for GPU Pods if the device plugin is misconfigured. The Pod is scheduled, the kubelet tries to allocate, fails. The Pod stays Pending.
-
Huge pages are a separate resource. They’re not allocated by a device plugin; they’re declared at the kubelet level. A node with 4 huge pages can have 4 Pods each asking for 1, or 2 Pods each asking for 2.
-
The kubelet’s
--max-podsflag limits Pod count, not resource count. A node with 4 GPUs can still have many Pods (each with 0.1 GPUs, in time-sliced mode). -
GPU sharing with cgroup v1 is broken. Use cgroup v2 for time-slicing and MIG.
See also
- Scheduling — the broader scheduling context
- Scheduler Internals — the NodeResourcesFit plugin
- ResourceQuota — namespace-level extended resource quotas