Docker Compose vers Kubernetes 1.35 "Timbernetes" marks a strategic turning point in the cloud-native ecosystem: Dynamic Resource Allocation (DRA) for cluster Mac Studio pour IAs goes beta, cementing Kubernetes' position as the reference platform for large-scale artificial intelligence workloads. With the explosion of "AI factories" and distributed machine learning pipelines, fine-grained management of hardware accelerators is becoming critical.
As traditional device plugins show their limits in terms of flexibility and granularity, DRA offers a declarative, extensible approach for orchestrating not only NVIDIA GPUs but also TPUs, FPGAs and other specialized accelerators. This technical evolution answers a major operational need: how do you efficiently share expensive hardware resources across dozens of training and inference pods while guaranteeing isolation and performance?
Kubernetes and the AI era: why DRA changes the game
Since Kubernetes 1.8, device plugins have made it possible to manage GPUs as countable resources. The problem: this approach stays rigid. A pod requests "1 GPU" or "2 GPUs", with no way to specify complex topological constraints such as "two GPUs sharing the same NVLink" or "a GPU with at least 40 GB of VRAM on a node with InfiniBand".
Dynamic Resource Allocation solves this by moving resource management logic out of the Kubernetes core and into external drivers. Instead of a simple quantity, DRA uses structured parameters that allow sophisticated requests to be expressed. Kubernetes 1.34 moved DRA to GA (General Availability) for the core APIs, and version 1.35 adds critical features:
- BindingConditions (beta): handling of GPUs that require setup time (fabric connection, firmware initialization)
- Prioritised alternatives (beta): define several acceptable GPU configurations in order of preference
- Device Taints and Tolerations (alpha): mark certain GPUs as reserved or degraded
- Partitionable Devices (alpha): split a GPU into slices (MIG on NVIDIA A100/H100)
- Consumable Capacity (alpha): tracking of consumable resources such as PCIe bandwidth
These features turn Kubernetes into an orchestrator capable of managing AI clusters with thousands of GPUs at a granularity worthy of a classic HPC scheduler, while keeping the cloud-native philosophy intact. To understand the impact on your pipelines, see our Kubernetes in production guide.
What exactly is Dynamic Resource Allocation?
DRA introduces three new native Kubernetes resources:
- DeviceClass: defines a type of hardware resource (e.g. "NVIDIA H100 80GB GPU") with its parameters and associated driver
- ResourceClaim: a request to allocate a resource, similar to a PersistentVolumeClaim for storage
- ResourceClaimTemplate: a template to automatically create per-pod claims in a Deployment or StatefulSet
Unlike device plugins where the kubelet handles everything locally, DRA works through a driver-based model:
┌─────────────────────┐
│ Pod with claim │
│ resourceClaims: │
│ - name: my-gpu │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ ResourceClaim │
│ deviceClassName: │
│ gpu.nvidia.com │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ DRA Controller │
│ (nvidia-dra-driver)│
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Kubelet Plugin │
│ (GPU setup on │
│ the node) │
└─────────────────────┘
The Kubernetes scheduler treats claims as placement constraints. If a pod requires a claim that needs an H100 GPU, the scheduler will only place that pod on nodes equipped with this hardware. Once the pod is scheduled, the DRA controller communicates with the kubelet to configure access to the GPU (device mapping, cgroups, permissions).
What's new in Kubernetes 1.35: beta features for GPUs
BindingConditions: handling hardware asynchronicity
Some enterprise GPUs require an initialization phase after allocation: establishing NVLink connections between GPUs, configuring GPU fabric managers, or firmware warm-up. Before Kubernetes 1.35, the kubelet started the pod immediately after allocation, which caused crashes if the GPU wasn't ready.
BindingConditions let the DRA driver signal "allocation succeeded, but GPU not ready yet". The kubelet then waits for the condition to turn Ready before launching the containers. This dramatically improves reliability on AI workloads with high-end GPUs (NVIDIA H100, AMD MI300) that may require 10-30 seconds of setup.
Prioritised alternatives: automatic fallback
Imagine a training pod that prefers 8x H100 with NVLink, but can run on 8x A100 or even 16x V100. With DRA in beta, you can express this preference:
deviceRequests:
- name: training-gpus
allocationMode: ExactCount
count: 8
alternatives:
- selector: gpu.nvidia.com/model=H100
priority: 100
- selector: gpu.nvidia.com/model=A100
priority: 50
- selector: gpu.nvidia.com/model=V100
count: 16 # Fallback with more GPUs
priority: 10
The scheduler first tries the priority 100 option. If no node satisfies that constraint, it tries priority 50, and so on. This optimizes cluster utilization by preventing pods from sitting in Pending while alternative GPUs are available.
Device Taints and Tolerations
Borrowed from the node taints/tolerations mechanism, this feature makes it possible to mark individual GPUs. Use cases:
- GPU degradation detected by monitoring (high temperature, ECC errors) → taint "degraded=true:NoSchedule"
- Reserving GPUs for a specific team → taint "team=ml-research:NoSchedule"
- GPU with specific firmware required for certain workloads
Pods must then include an explicit toleration to use these tainted GPUs, offering fine-grained control at the device level, not just at the node level.
Partitionable Devices: MIG and virtual GPUs
NVIDIA Multi-Instance GPU (MIG) makes it possible to split an A100 or H100 into up to 7 isolated instances. The DRA alpha in Kubernetes 1.35 exposes this capability natively. A DeviceClass can define "partitions":
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: nvidia-h100-mig-3g.40gb
spec:
selectors:
- cel:
expression: device.driver == "gpu.nvidia.com"
config:
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: gpu.nvidia.com/v1alpha1
kind: MIGDeviceClaimParameters
profile: 3g.40gb # MIG profile: 3 GPU slices, 40 GB VRAM
sharing:
strategy: TimeSlicing # Or MPS for fine-grained sharing
This makes it possible to run 7 inference pods on a single physical H100, each with its own isolated MIG slice. Ideal for maximizing GPU utilization on lightweight inference workloads. For deployment context, see our article on Docker in production.
DRA configuration for GPUs: a practical example
Let's deploy a Kubernetes 1.35 cluster with the NVIDIA DRA driver to manage H100 GPUs in full DRA mode.
1. Infrastructure prerequisites
# Nodes with NVIDIA GPUs + drivers installed
# Compatible container runtime (containerd 1.7+, cri-o 1.28+)
# Kubernetes 1.35+ with feature gate DynamicResourceAllocation=true
# Check DRA support
kubectl api-resources | grep "resource.k8s.io"
# Should display: deviceclasses, resourceclaims, resourceclaimtemplates
2. Installing the NVIDIA DRA driver
# Install the NVIDIA GPU operator
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/v24.9.0/deployments/gpu-operator.yaml
# Deploy the DRA driver (replaces nvidia-device-plugin)
helm repo add nvidia https://nvidia.github.io/k8s-dra-driver
helm repo update
helm install nvidia-dra-driver nvidia/k8s-dra-driver \
--namespace nvidia-dra-system \
--create-namespace \
--set enableDRA=true \
--set enableDevicePlugin=false # Disables the legacy mode
3. Create a DeviceClass for H100
apiVersion: resource.k8s.io/v1beta2
kind: DeviceClass
metadata:
name: nvidia-h100-80gb
spec:
selectors:
- cel:
expression: 'device.driver == "gpu.nvidia.com" && device.attributes["gpu.nvidia.com/model"] == "H100"'
config:
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: gpu.nvidia.com/v1alpha1
kind: GpuClaimParameters
sharing:
strategy: None # Exclusive GPU, no sharing
memory: "80Gi" # Minimum VRAM
topology:
nvlink: "required" # Requires NVLink between GPUs if count > 1
4. ResourceClaimTemplate for an ML Deployment
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
name: training-gpu-claim
namespace: ml-team
spec:
spec:
deviceClassName: nvidia-h100-80gb
allocationMode: ExactCount
count: 2 # 2 GPUs per pod
constraints:
- matchAttribute: "gpu.nvidia.com/nvlink-domain"
# The 2 GPUs must be in the same NVLink domain
operator: Equal
5. Use the claim in a PyTorch Pod
apiVersion: v1
kind: Pod
metadata:
name: pytorch-distributed-training
namespace: ml-team
spec:
resourceClaims:
- name: gpus
resourceClaimTemplateName: training-gpu-claim
containers:
- name: trainer
image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime
command: ["python", "train.py"]
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "DRA" # Tells the runtime to use DRA
resources:
claims:
- name: gpus # Reference to the resourceClaim
limits:
memory: "64Gi"
cpu: "16"
restartPolicy: OnFailure
Once deployed, check the allocation:
# Claim status
kubectl get resourceclaim -n ml-team
kubectl describe resourceclaim training-gpu-claim-pytorch-distributed-training-xxxxx
# DRA controller logs
kubectl logs -n nvidia-dra-system deploy/nvidia-dra-controller
# Allocated devices visible inside the pod
kubectl exec -it pytorch-distributed-training -- nvidia-smi
The scheduler automatically placed the pod on a node with 2x H100 connected over NVLink, in line with the claim's constraints. To manage the deployment in production, refer to our Terraform for Kubernetes guide.
Impact on ML pipelines: from development to production
Stronger isolation and multi-tenancy
With DRA, each ML team can have its own DeviceClasses with quotas and constraints. A "ml-research" namespace can allow claims of up to 8 GPUs, while "ml-production" can go up to 64 GPUs for LLM training. Kubernetes ResourceQuotas apply to claims:
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-research-quota
namespace: ml-research
spec:
hard:
count/resourceclaims: "10" # Max 10 active claims
nvidia-h100-80gb.resource.k8s.io/count: "16" # Max 16 H100 GPUs total
Smart scheduling for heterogeneous workloads
A typical AI cluster mixes several types of workloads:
- Model training: 8-64 GPUs, high NVLink bandwidth, 12-72h duration
- Fine-tuning: 1-4 GPUs, less demanding inter-GPU, 2-8h duration
- Batch inference: 1-2 GPUs, can tolerate MIG or shared GPUs
- Real-time inference: 1 MIG GPU slice, latency-critical
DRA makes it possible to define DeviceClasses tailored to each use case. The scheduler can bin-pack efficiently: MIG inference pods on partially used nodes, and reserve entire nodes for distributed training. This significantly improves the GPU utilization rate compared to classic device plugins, according to NVIDIA benchmarks.
Lifecycle management: draining and maintenance
Draining a GPU node in production becomes cleaner. With device plugins, a kubectl drain abruptly evicted pods. With DRA, the controller can:
- Mark the node's GPUs with a taint "maintenance=true:NoSchedule"
- Wait for in-progress training pods to checkpoint (via PreStop hooks)
- Progressively release the claims
- Signal to the node that it can be taken offline
This approach reduces lost work on multi-day training runs. For update strategies, see our Nginx production optimizations, which share similar patterns.
GPU monitoring with Prometheus: DRA-aware metrics
DRA exposes new metrics via the OpenTelemetry and Prometheus standards. NVIDIA DCGM (Data Center GPU Manager) integrates natively with the DRA driver.
Deploying the monitoring stack
# Install DCGM exporter with DRA support
helm repo add nvidia https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter nvidia/dcgm-exporter \
--namespace monitoring \
--set serviceMonitor.enabled=true \
--set arguments=["--dra-enabled", "--collect-interval=5000"]
# Prometheus automatically discovers the exporters via ServiceMonitor
# Check the targets
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# http://localhost:9090/targets → dcgm-exporter endpoints
Key DRA metrics
The exposed metrics include:
# GPU utilization per claim
DCGM_FI_DEV_GPU_UTIL{resourceclaim="training-gpu-claim-xxxxx", namespace="ml-team", pod="pytorch-training-0", gpu="0"}
# VRAM used vs allocated per claim
DCGM_FI_DEV_FB_USED{resourceclaim="...", gpu="0"}
DCGM_FI_DEV_FB_RESERVED{resourceclaim="...", gpu="0"}
# NVLink throughput between GPUs of the same claim
DCGM_FI_PROF_NVLINK_TX_BYTES{resourceclaim="...", gpu_pair="0-1"}
# DRA constraint violations (e.g. GPU temperature > threshold)
dra_constraint_violations_total{deviceclass="nvidia-h100-80gb", constraint="temperature"}
# DRA allocation time (BindingConditions)
dra_allocation_duration_seconds{deviceclass="nvidia-h100-80gb", phase="binding"}
Grafana dashboard for DRA
Create a dashboard centered on claims rather than nodes:
# Panel: GPU utilization per ResourceClaim
avg by (resourceclaim, namespace) (DCGM_FI_DEV_GPU_UTIL)
# Panel: Allocation efficiency (GPUs used / GPUs allocated in the cluster)
sum(DCGM_FI_DEV_GPU_UTIL > 10) / sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}) * 100
# Panel: Average DRA allocation duration
histogram_quantile(0.95, sum(rate(dra_allocation_duration_seconds_bucket[5m])) by (le, deviceclass))
# Alert: Claim pending too long
ALERT DRAClaimPendingTooLong
IF kube_resourceclaim_status_allocation_state{state="pending"} > 300
FOR 5m
ANNOTATIONS {
summary = "ResourceClaim {{ $labels.resourceclaim }} pending > 5min",
description = "Possible shortage of {{ $labels.deviceclass }} GPUs in the cluster"
}
For a complete Kubernetes monitoring setup, see our Prometheus and Grafana in production guide and the article on Linux metrics in production.
AI workload observability
Coupled with tools like MLflow or Weights & Biases, DRA monitoring makes it possible to correlate ML performance and GPU utilization:
# Example: trace GPU util vs training loss
# In the PyTorch code, log DRA metrics
import prometheus_client as prom
gpu_util_gauge = prom.Gauge('training_gpu_util', 'GPU utilization during training', ['epoch', 'claim'])
for epoch in range(num_epochs):
# ... training loop ...
# Retrieve the claim's GPU metrics
gpu_util = get_claim_gpu_metrics() # Via Prometheus API or DCGM
gpu_util_gauge.labels(epoch=epoch, claim=os.getenv('CLAIM_NAME')).set(gpu_util)
# Log training loss
wandb.log({'loss': loss, 'gpu_util': gpu_util, 'epoch': epoch})
This reveals valuable insights: if GPU util drops to 30% during certain training steps, it points to an I/O or CPU bottleneck that wastes expensive GPU time.
Migrating from device plugins: a progressive strategy
Migrating a production GPU cluster to DRA requires a progressive approach to avoid disruptions.
Phase 1: Dual-stack (weeks 1-2)
# Keep nvidia-device-plugin active
kubectl get daemonset -n kube-system nvidia-device-plugin-daemonset
# Install the DRA driver in parallel
helm install nvidia-dra-driver nvidia/k8s-dra-driver \
--set coexistWithDevicePlugin=true
# Existing pods keep using resources.limits.nvidia.com/gpu
# New workloads test DRA on a dedicated DeviceClass
kubectl apply -f deviceclass-test.yaml
Phase 2: Migration per namespace (weeks 3-6)
Migrate namespace by namespace, starting with dev/staging:
# Namespace ml-dev: forbid the legacy mode
kubectl label namespace ml-dev gpu-mode=dra-only
# An AdmissionWebhook rejects pods with resources.limits.nvidia.com/gpu in this namespace
# Forces the use of resourceClaims
# Update the Deployments/StatefulSets
# Before:
resources:
limits:
nvidia.com/gpu: 2
# After:
resourceClaims:
- name: gpus
resourceClaimTemplateName: ml-dev-gpu-claim
resources:
claims:
- name: gpus
Phase 3: Disabling the device plugin (week 7+)
Once all namespaces are migrated:
# Verify that no pod uses the legacy mode
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) | .metadata.name'
# If empty, disable the device plugin
kubectl delete daemonset -n kube-system nvidia-device-plugin-daemonset
# Remove the legacy feature gate
# In kube-apiserver and kubelet flags:
--feature-gates=DevicePlugins=false
Rollback plan
Keep a plan B in case of critical regression:
- etcd snapshot before each phase
- Keep the device plugin manifests in Git
- Automatic rollback script if pod error rate > threshold
- Canary deployments: 10% of pods on DRA, monitor 48h before scaling up
DRA limitations and roadmap
While Kubernetes 1.35 brings major advances, some limitations remain:
- No live migration: a pod cannot migrate its claim to another node without a restart
- No GPU snapshots: it's impossible to checkpoint a GPU's VRAM state for migration
- Limited dynamic quotas: quotas remain static, with no auto-scaling of GPU quotas
- No multi-cluster DRA: no federation of claims across Kubernetes clusters
The Kubernetes 1.36-1.37 roadmap plans:
- DRA for other resources: networking (SmartNIC, RDMA), storage (NVMe-oF), memory (CXL)
- Integration with the Cluster Autoscaler: automatically provision GPU nodes based on pending claims
- Cost-aware scheduling: pick the cheapest GPU that satisfies the constraints (multi-provider cloud)
- Native GPU time-slicing: fine-grained temporal sharing without MIG, for consumer GPUs
To follow the development, see KEP 4381 and SIG Node.
Conclusion: Kubernetes, the infrastructure of AI factories
Kubernetes 1.35 confirms the transformation of K8s from a general-purpose container orchestrator into the reference platform for distributed AI. Moving GPU DRA to beta removes a major friction point for ML teams: manually managing GPU allocation via custom scripts or legacy HPC systems like SLURM.
DRA's declarative approach aligns perfectly with the GitOps philosophy: all GPU configuration (DeviceClasses, quotas, constraints) is versioned in YAML, auditable, and reproducible. Coupled with tools like ArgoCD or Flux, you can deploy complete ML environments in minutes.
For organizations running GPU clusters at scale (100+ GPUs), the gains are tangible:
- Significant improvement in GPU utilization thanks to smart scheduling and MIG
- Reduced allocation debugging time with DRA observability
- Lower cloud costs through optimized GPU bin-packing
- Faster time-to-production for new ML projects
If you're starting a new GPU cluster, adopt DRA right away. If you have an existing cluster with device plugins, plan the migration for Q2 2026 to take advantage of the stabilized beta features. Kubernetes 1.36 (expected April 2026, released on April 22) should see several of these features go stable.
The era of "AI factories" is only just beginning, and Kubernetes is establishing itself as the operating system of this revolution. To dig deeper into containerization, explore our resources on Docker and our production lessons learned.
FAQ: Dynamic Resource Allocation and GPUs in Kubernetes
What's the difference between DRA and device plugins for GPUs?
Device plugins only allow you to request a quantity of GPUs (1, 2, 4...) with no additional constraints. DRA (Dynamic Resource Allocation) introduces structured parameters that make it possible to specify complex constraints: NVLink topology, minimum VRAM, MIG profiles, affinity with other resources. DRA also moves the logic out of the Kubernetes core and into external drivers, offering greater flexibility and extensibility.
Should I migrate my GPU workloads to DRA immediately?
No, a progressive migration is recommended. Device plugins keep working in Kubernetes 1.35. Start by installing the DRA driver in dual-stack mode, test on non-critical namespaces, then migrate progressively. New GPU clusters should adopt DRA right away, while existing clusters can plan the migration over 6-8 weeks.
Does DRA work with GPUs other than NVIDIA?
Yes, DRA is vendor-agnostic. AMD offers a DRA driver for its MI300 GPUs, Intel for its Data Center Max GPUs. The DRA framework can also manage other accelerators: Google TPUs, Habana Gaudi, Xilinx FPGAs. Each vendor provides its own driver that implements Kubernetes' standard DRA API.
How do I monitor GPUs with DRA in Prometheus?
Use the NVIDIA DCGM Exporter with the --dra-enabled flag. The exposed metrics include GPU utilization per ResourceClaim, VRAM allocated vs used, NVLink throughput, and DRA allocation durations. Create Grafana dashboards centered on claims rather than nodes for per-ML-workload visibility. Prometheus ServiceMonitors automatically discover the DCGM exporters.
Can a GPU be shared across several pods with DRA?
Yes, through several mechanisms. MIG (Multi-Instance GPU) on NVIDIA A100/H100 splits a GPU into isolated slices. Time-slicing enables temporal sharing for GPUs without MIG. ResourceClaims can also be shared across pods (shared mode instead of exclusive) for lightweight inference workloads. The sharing strategy is configured in the DeviceClass.
What are the current limitations of DRA in Kubernetes 1.35?
DRA does not yet support live migration of GPU pods, VRAM state snapshots, or dynamic auto-scaling of GPU quotas. Multi-cluster federation of ResourceClaims doesn't exist. Some features such as Partitionable Devices, Device Taints and Consumable Capacity remain in alpha in Kubernetes 1.35. Kubernetes 1.36-1.37 should stabilize these features and add integration with the Cluster Autoscaler to automatically provision GPU nodes.
Comments