DevOps 13/02/2026 15 min read

Kubernetes 1.35: Dynamic Resource Allocation goes beta for GPUs

Kubernetes 1.35 moves DRA to beta for GPUs, marking K8s' pivot toward AI factories. Configuration, migration and Prometheus monitoring.

Docker Compose vers Kubernetes 1.35 "Timbernetes" marks a strategic turning point in the cloud-native ecosystem: Dynamic Resource Allocation (DRA) for cluster Mac Studio pour IAs goes beta, cementing Kubernetes' position as the reference platform for large-scale artificial intelligence workloads. With the explosion of "AI factories" and distributed machine learning pipelines, fine-grained management of hardware accelerators is becoming critical.

As traditional device plugins show their limits in terms of flexibility and granularity, DRA offers a declarative, extensible approach for orchestrating not only NVIDIA GPUs but also TPUs, FPGAs and other specialized accelerators. This technical evolution answers a major operational need: how do you efficiently share expensive hardware resources across dozens of training and inference pods while guaranteeing isolation and performance?

Kubernetes and the AI era: why DRA changes the game

Since Kubernetes 1.8, device plugins have made it possible to manage GPUs as countable resources. The problem: this approach stays rigid. A pod requests "1 GPU" or "2 GPUs", with no way to specify complex topological constraints such as "two GPUs sharing the same NVLink" or "a GPU with at least 40 GB of VRAM on a node with InfiniBand".

Dynamic Resource Allocation solves this by moving resource management logic out of the Kubernetes core and into external drivers. Instead of a simple quantity, DRA uses structured parameters that allow sophisticated requests to be expressed. Kubernetes 1.34 moved DRA to GA (General Availability) for the core APIs, and version 1.35 adds critical features:

BindingConditions (beta): handling of GPUs that require setup time (fabric connection, firmware initialization)
Prioritised alternatives (beta): define several acceptable GPU configurations in order of preference
Device Taints and Tolerations (alpha): mark certain GPUs as reserved or degraded
Partitionable Devices (alpha): split a GPU into slices (MIG on NVIDIA A100/H100)
Consumable Capacity (alpha): tracking of consumable resources such as PCIe bandwidth

These features turn Kubernetes into an orchestrator capable of managing AI clusters with thousands of GPUs at a granularity worthy of a classic HPC scheduler, while keeping the cloud-native philosophy intact. To understand the impact on your pipelines, see our Kubernetes in production guide.

What exactly is Dynamic Resource Allocation?

DRA introduces three new native Kubernetes resources:

DeviceClass: defines a type of hardware resource (e.g. "NVIDIA H100 80GB GPU") with its parameters and associated driver
ResourceClaim: a request to allocate a resource, similar to a PersistentVolumeClaim for storage
ResourceClaimTemplate: a template to automatically create per-pod claims in a Deployment or StatefulSet

Unlike device plugins where the kubelet handles everything locally, DRA works through a driver-based model:

┌─────────────────────┐
│   Pod with claim    │
│  resourceClaims:    │
│   - name: my-gpu    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  ResourceClaim      │
│  deviceClassName:   │
│   gpu.nvidia.com    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   DRA Controller    │
│  (nvidia-dra-driver)│
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Kubelet Plugin     │
│  (GPU setup on      │
│   the node)         │
└─────────────────────┘

The Kubernetes scheduler treats claims as placement constraints. If a pod requires a claim that needs an H100 GPU, the scheduler will only place that pod on nodes equipped with this hardware. Once the pod is scheduled, the DRA controller communicates with the kubelet to configure access to the GPU (device mapping, cgroups, permissions).

What's new in Kubernetes 1.35: beta features for GPUs

BindingConditions: handling hardware asynchronicity

Some enterprise GPUs require an initialization phase after allocation: establishing NVLink connections between GPUs, configuring GPU fabric managers, or firmware warm-up. Before Kubernetes 1.35, the kubelet started the pod immediately after allocation, which caused crashes if the GPU wasn't ready.

BindingConditions let the DRA driver signal "allocation succeeded, but GPU not ready yet". The kubelet then waits for the condition to turn Ready before launching the containers. This dramatically improves reliability on AI workloads with high-end GPUs (NVIDIA H100, AMD MI300) that may require 10-30 seconds of setup.

Prioritised alternatives: automatic fallback

Imagine a training pod that prefers 8x H100 with NVLink, but can run on 8x A100 or even 16x V100. With DRA in beta, you can express this preference:

deviceRequests:
  - name: training-gpus
    allocationMode: ExactCount
    count: 8
    alternatives:
      - selector: gpu.nvidia.com/model=H100
        priority: 100
      - selector: gpu.nvidia.com/model=A100
        priority: 50
      - selector: gpu.nvidia.com/model=V100
        count: 16  # Fallback with more GPUs
        priority: 10

The scheduler first tries the priority 100 option. If no node satisfies that constraint, it tries priority 50, and so on. This optimizes cluster utilization by preventing pods from sitting in Pending while alternative GPUs are available.

Device Taints and Tolerations

Borrowed from the node taints/tolerations mechanism, this feature makes it possible to mark individual GPUs. Use cases:

GPU degradation detected by monitoring (high temperature, ECC errors) → taint "degraded=true:NoSchedule"
Reserving GPUs for a specific team → taint "team=ml-research:NoSchedule"
GPU with specific firmware required for certain workloads

Pods must then include an explicit toleration to use these tainted GPUs, offering fine-grained control at the device level, not just at the node level.

Partitionable Devices: MIG and virtual GPUs

NVIDIA Multi-Instance GPU (MIG) makes it possible to split an A100 or H100 into up to 7 isolated instances. The DRA alpha in Kubernetes 1.35 exposes this capability natively. A DeviceClass can define "partitions":

apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: nvidia-h100-mig-3g.40gb
spec:
  selectors:
    - cel:
        expression: device.driver == "gpu.nvidia.com"
  config:
    opaque:
      driver: gpu.nvidia.com
      parameters:
        apiVersion: gpu.nvidia.com/v1alpha1
        kind: MIGDeviceClaimParameters
        profile: 3g.40gb  # MIG profile: 3 GPU slices, 40 GB VRAM
        sharing:
          strategy: TimeSlicing  # Or MPS for fine-grained sharing

This makes it possible to run 7 inference pods on a single physical H100, each with its own isolated MIG slice. Ideal for maximizing GPU utilization on lightweight inference workloads. For deployment context, see our article on Docker in production.

DRA configuration for GPUs: a practical example

Let's deploy a Kubernetes 1.35 cluster with the NVIDIA DRA driver to manage H100 GPUs in full DRA mode.

1. Infrastructure prerequisites

# Nodes with NVIDIA GPUs + drivers installed
# Compatible container runtime (containerd 1.7+, cri-o 1.28+)
# Kubernetes 1.35+ with feature gate DynamicResourceAllocation=true

# Check DRA support
kubectl api-resources | grep "resource.k8s.io"
# Should display: deviceclasses, resourceclaims, resourceclaimtemplates

2. Installing the NVIDIA DRA driver

# Install the NVIDIA GPU operator
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/v24.9.0/deployments/gpu-operator.yaml

# Deploy the DRA driver (replaces nvidia-device-plugin)
helm repo add nvidia https://nvidia.github.io/k8s-dra-driver
helm repo update
helm install nvidia-dra-driver nvidia/k8s-dra-driver \
  --namespace nvidia-dra-system \
  --create-namespace \
  --set enableDRA=true \
  --set enableDevicePlugin=false  # Disables the legacy mode

3. Create a DeviceClass for H100

apiVersion: resource.k8s.io/v1beta2
kind: DeviceClass
metadata:
  name: nvidia-h100-80gb
spec:
  selectors:
    - cel:
        expression: 'device.driver == "gpu.nvidia.com" && device.attributes["gpu.nvidia.com/model"] == "H100"'
  config:
    opaque:
      driver: gpu.nvidia.com
      parameters:
        apiVersion: gpu.nvidia.com/v1alpha1
        kind: GpuClaimParameters
        sharing:
          strategy: None  # Exclusive GPU, no sharing
        memory: "80Gi"  # Minimum VRAM
        topology:
          nvlink: "required"  # Requires NVLink between GPUs if count > 1

4. ResourceClaimTemplate for an ML Deployment

apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaimTemplate
metadata:
  name: training-gpu-claim
  namespace: ml-team
spec:
  spec:
    deviceClassName: nvidia-h100-80gb
    allocationMode: ExactCount
    count: 2  # 2 GPUs per pod
    constraints:
      - matchAttribute: "gpu.nvidia.com/nvlink-domain"
        # The 2 GPUs must be in the same NVLink domain
        operator: Equal

5. Use the claim in a PyTorch Pod

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-distributed-training
  namespace: ml-team
spec:
  resourceClaims:
    - name: gpus
      resourceClaimTemplateName: training-gpu-claim

  containers:
    - name: trainer
      image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime
      command: ["python", "train.py"]
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "DRA"  # Tells the runtime to use DRA

      resources:
        claims:
          - name: gpus  # Reference to the resourceClaim
        limits:
          memory: "64Gi"
          cpu: "16"

  restartPolicy: OnFailure

Once deployed, check the allocation:

# Claim status
kubectl get resourceclaim -n ml-team
kubectl describe resourceclaim training-gpu-claim-pytorch-distributed-training-xxxxx

# DRA controller logs
kubectl logs -n nvidia-dra-system deploy/nvidia-dra-controller

# Allocated devices visible inside the pod
kubectl exec -it pytorch-distributed-training -- nvidia-smi

The scheduler automatically placed the pod on a node with 2x H100 connected over NVLink, in line with the claim's constraints. To manage the deployment in production, refer to our Terraform for Kubernetes guide.

Impact on ML pipelines: from development to production

Stronger isolation and multi-tenancy

With DRA, each ML team can have its own DeviceClasses with quotas and constraints. A "ml-research" namespace can allow claims of up to 8 GPUs, while "ml-production" can go up to 64 GPUs for LLM training. Kubernetes ResourceQuotas apply to claims:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-research-quota
  namespace: ml-research
spec:
  hard:
    count/resourceclaims: "10"  # Max 10 active claims
    nvidia-h100-80gb.resource.k8s.io/count: "16"  # Max 16 H100 GPUs total

Smart scheduling for heterogeneous workloads

A typical AI cluster mixes several types of workloads:

Model training: 8-64 GPUs, high NVLink bandwidth, 12-72h duration
Fine-tuning: 1-4 GPUs, less demanding inter-GPU, 2-8h duration
Batch inference: 1-2 GPUs, can tolerate MIG or shared GPUs
Real-time inference: 1 MIG GPU slice, latency-critical

DRA makes it possible to define DeviceClasses tailored to each use case. The scheduler can bin-pack efficiently: MIG inference pods on partially used nodes, and reserve entire nodes for distributed training. This significantly improves the GPU utilization rate compared to classic device plugins, according to NVIDIA benchmarks.

Lifecycle management: draining and maintenance

Draining a GPU node in production becomes cleaner. With device plugins, a kubectl drain abruptly evicted pods. With DRA, the controller can:

Mark the node's GPUs with a taint "maintenance=true:NoSchedule"
Wait for in-progress training pods to checkpoint (via PreStop hooks)
Progressively release the claims
Signal to the node that it can be taken offline

This approach reduces lost work on multi-day training runs. For update strategies, see our Nginx production optimizations, which share similar patterns.

GPU monitoring with Prometheus: DRA-aware metrics

DRA exposes new metrics via the OpenTelemetry and Prometheus standards. NVIDIA DCGM (Data Center GPU Manager) integrates natively with the DRA driver.

Deploying the monitoring stack

# Install DCGM exporter with DRA support
helm repo add nvidia https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true \
  --set arguments=["--dra-enabled", "--collect-interval=5000"]

# Prometheus automatically discovers the exporters via ServiceMonitor
# Check the targets
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# http://localhost:9090/targets → dcgm-exporter endpoints

Key DRA metrics

The exposed metrics include:

# GPU utilization per claim
DCGM_FI_DEV_GPU_UTIL{resourceclaim="training-gpu-claim-xxxxx", namespace="ml-team", pod="pytorch-training-0", gpu="0"}

# VRAM used vs allocated per claim
DCGM_FI_DEV_FB_USED{resourceclaim="...", gpu="0"}
DCGM_FI_DEV_FB_RESERVED{resourceclaim="...", gpu="0"}

# NVLink throughput between GPUs of the same claim
DCGM_FI_PROF_NVLINK_TX_BYTES{resourceclaim="...", gpu_pair="0-1"}

# DRA constraint violations (e.g. GPU temperature > threshold)
dra_constraint_violations_total{deviceclass="nvidia-h100-80gb", constraint="temperature"}

# DRA allocation time (BindingConditions)
dra_allocation_duration_seconds{deviceclass="nvidia-h100-80gb", phase="binding"}

Grafana dashboard for DRA

Create a dashboard centered on claims rather than nodes:

# Panel: GPU utilization per ResourceClaim
avg by (resourceclaim, namespace) (DCGM_FI_DEV_GPU_UTIL)

# Panel: Allocation efficiency (GPUs used / GPUs allocated in the cluster)
sum(DCGM_FI_DEV_GPU_UTIL > 10) / sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}) * 100

# Panel: Average DRA allocation duration
histogram_quantile(0.95, sum(rate(dra_allocation_duration_seconds_bucket[5m])) by (le, deviceclass))

# Alert: Claim pending too long
ALERT DRAClaimPendingTooLong
  IF kube_resourceclaim_status_allocation_state{state="pending"} > 300
  FOR 5m
  ANNOTATIONS {
    summary = "ResourceClaim {{ $labels.resourceclaim }} pending > 5min",
    description = "Possible shortage of {{ $labels.deviceclass }} GPUs in the cluster"
  }

For a complete Kubernetes monitoring setup, see our Prometheus and Grafana in production guide and the article on Linux metrics in production.

AI workload observability

Coupled with tools like MLflow or Weights & Biases, DRA monitoring makes it possible to correlate ML performance and GPU utilization:

# Example: trace GPU util vs training loss
# In the PyTorch code, log DRA metrics
import prometheus_client as prom

gpu_util_gauge = prom.Gauge('training_gpu_util', 'GPU utilization during training', ['epoch', 'claim'])

for epoch in range(num_epochs):
    # ... training loop ...

    # Retrieve the claim's GPU metrics
    gpu_util = get_claim_gpu_metrics()  # Via Prometheus API or DCGM
    gpu_util_gauge.labels(epoch=epoch, claim=os.getenv('CLAIM_NAME')).set(gpu_util)

    # Log training loss
    wandb.log({'loss': loss, 'gpu_util': gpu_util, 'epoch': epoch})

This reveals valuable insights: if GPU util drops to 30% during certain training steps, it points to an I/O or CPU bottleneck that wastes expensive GPU time.

Migrating from device plugins: a progressive strategy

Migrating a production GPU cluster to DRA requires a progressive approach to avoid disruptions.

Phase 1: Dual-stack (weeks 1-2)

# Keep nvidia-device-plugin active
kubectl get daemonset -n kube-system nvidia-device-plugin-daemonset

# Install the DRA driver in parallel
helm install nvidia-dra-driver nvidia/k8s-dra-driver \
  --set coexistWithDevicePlugin=true

# Existing pods keep using resources.limits.nvidia.com/gpu
# New workloads test DRA on a dedicated DeviceClass
kubectl apply -f deviceclass-test.yaml

Phase 2: Migration per namespace (weeks 3-6)

Migrate namespace by namespace, starting with dev/staging:

# Namespace ml-dev: forbid the legacy mode
kubectl label namespace ml-dev gpu-mode=dra-only

# An AdmissionWebhook rejects pods with resources.limits.nvidia.com/gpu in this namespace
# Forces the use of resourceClaims

# Update the Deployments/StatefulSets
# Before:
resources:
  limits:
    nvidia.com/gpu: 2

# After:
resourceClaims:
  - name: gpus
    resourceClaimTemplateName: ml-dev-gpu-claim
resources:
  claims:
    - name: gpus

Phase 3: Disabling the device plugin (week 7+)

Once all namespaces are migrated:

# Verify that no pod uses the legacy mode
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) | .metadata.name'

# If empty, disable the device plugin
kubectl delete daemonset -n kube-system nvidia-device-plugin-daemonset

# Remove the legacy feature gate
# In kube-apiserver and kubelet flags:
--feature-gates=DevicePlugins=false

Rollback plan

Keep a plan B in case of critical regression:

etcd snapshot before each phase
Keep the device plugin manifests in Git
Automatic rollback script if pod error rate > threshold
Canary deployments: 10% of pods on DRA, monitor 48h before scaling up

DRA limitations and roadmap

While Kubernetes 1.35 brings major advances, some limitations remain:

No live migration: a pod cannot migrate its claim to another node without a restart
No GPU snapshots: it's impossible to checkpoint a GPU's VRAM state for migration
Limited dynamic quotas: quotas remain static, with no auto-scaling of GPU quotas
No multi-cluster DRA: no federation of claims across Kubernetes clusters

The Kubernetes 1.36-1.37 roadmap plans:

DRA for other resources: networking (SmartNIC, RDMA), storage (NVMe-oF), memory (CXL)
Integration with the Cluster Autoscaler: automatically provision GPU nodes based on pending claims
Cost-aware scheduling: pick the cheapest GPU that satisfies the constraints (multi-provider cloud)
Native GPU time-slicing: fine-grained temporal sharing without MIG, for consumer GPUs

To follow the development, see KEP 4381 and SIG Node.

Conclusion: Kubernetes, the infrastructure of AI factories

Kubernetes 1.35 confirms the transformation of K8s from a general-purpose container orchestrator into the reference platform for distributed AI. Moving GPU DRA to beta removes a major friction point for ML teams: manually managing GPU allocation via custom scripts or legacy HPC systems like SLURM.

DRA's declarative approach aligns perfectly with the GitOps philosophy: all GPU configuration (DeviceClasses, quotas, constraints) is versioned in YAML, auditable, and reproducible. Coupled with tools like ArgoCD or Flux, you can deploy complete ML environments in minutes.

For organizations running GPU clusters at scale (100+ GPUs), the gains are tangible:

Significant improvement in GPU utilization thanks to smart scheduling and MIG
Reduced allocation debugging time with DRA observability
Lower cloud costs through optimized GPU bin-packing
Faster time-to-production for new ML projects

If you're starting a new GPU cluster, adopt DRA right away. If you have an existing cluster with device plugins, plan the migration for Q2 2026 to take advantage of the stabilized beta features. Kubernetes 1.36 (expected April 2026, released on April 22) should see several of these features go stable.

The era of "AI factories" is only just beginning, and Kubernetes is establishing itself as the operating system of this revolution. To dig deeper into containerization, explore our resources on Docker and our production lessons learned.

FAQ: Dynamic Resource Allocation and GPUs in Kubernetes

What's the difference between DRA and device plugins for GPUs?

Device plugins only allow you to request a quantity of GPUs (1, 2, 4...) with no additional constraints. DRA (Dynamic Resource Allocation) introduces structured parameters that make it possible to specify complex constraints: NVLink topology, minimum VRAM, MIG profiles, affinity with other resources. DRA also moves the logic out of the Kubernetes core and into external drivers, offering greater flexibility and extensibility.

Should I migrate my GPU workloads to DRA immediately?

No, a progressive migration is recommended. Device plugins keep working in Kubernetes 1.35. Start by installing the DRA driver in dual-stack mode, test on non-critical namespaces, then migrate progressively. New GPU clusters should adopt DRA right away, while existing clusters can plan the migration over 6-8 weeks.

Does DRA work with GPUs other than NVIDIA?

Yes, DRA is vendor-agnostic. AMD offers a DRA driver for its MI300 GPUs, Intel for its Data Center Max GPUs. The DRA framework can also manage other accelerators: Google TPUs, Habana Gaudi, Xilinx FPGAs. Each vendor provides its own driver that implements Kubernetes' standard DRA API.

How do I monitor GPUs with DRA in Prometheus?

Use the NVIDIA DCGM Exporter with the --dra-enabled flag. The exposed metrics include GPU utilization per ResourceClaim, VRAM allocated vs used, NVLink throughput, and DRA allocation durations. Create Grafana dashboards centered on claims rather than nodes for per-ML-workload visibility. Prometheus ServiceMonitors automatically discover the DCGM exporters.

Can a GPU be shared across several pods with DRA?

Yes, through several mechanisms. MIG (Multi-Instance GPU) on NVIDIA A100/H100 splits a GPU into isolated slices. Time-slicing enables temporal sharing for GPUs without MIG. ResourceClaims can also be shared across pods (shared mode instead of exclusive) for lightweight inference workloads. The sharing strategy is configured in the DeviceClass.

What are the current limitations of DRA in Kubernetes 1.35?

DRA does not yet support live migration of GPU pods, VRAM state snapshots, or dynamic auto-scaling of GPU quotas. Multi-cluster federation of ResourceClaims doesn't exist. Some features such as Partitionable Devices, Device Taints and Consumable Capacity remain in alpha in Kubernetes 1.35. Kubernetes 1.36-1.37 should stabilize these features and add integration with the Cluster Autoscaler to automatically provision GPU nodes.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Contact me

kubernetes gpu devops ai dra nvidia prometheus machine-learning cloud-native

Back to the blog