AKS [BUG] Missing nvidia.com/gpu capacity in Spot nodes

Describe the bug Missing nvidia.com/gpu: 1 capacity in the Spot nodes

VM size: beta.kubernetes.io/instance-type=Standard_NC6s_v3

To Reproduce Steps to reproduce the behavior:

Create a Spot GPU node pool:

resource "azurerm_kubernetes_cluster_node_pool" "gpu_spot_node_pool" {
  name                   = "gpuspot"
  kubernetes_cluster_id  = module.kubernetes.cluster_id
  kubelet_disk_type      = "OS"
  vm_size                = "Standard_NC6s_v3"
  priority               = "Spot"
  zones                  = ["1", "2", "3"]
  node_count             = "0"
  min_count              = "0"
  max_count              = "3"
  enable_auto_scaling    = "true"
  enable_host_encryption = "true"
  vnet_subnet_id         = azurerm_subnet.kubeflow_aks_subnet.id
  eviction_policy        = "Delete"
  spot_max_price         = "-1"
  node_labels = {
    type                                    = "gpu-spot",
    "kubernetes.azure.com/scalesetpriority" = "spot"
  }
  node_taints = [
    "type=gpu-spot:NoSchedule",
    "kubernetes.azure.com/scalesetpriority=spot:NoSchedule",
  ]

  lifecycle {
    ignore_changes = [
      node_count
    ]
  }

  depends_on = [module.kubernetes]
}

Then deploy a pod/job with a proper affinity/toleration rules and set k8s resources:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: type
            operator: In
            values:
            - gpu-spot
  containers:
  - name: test-gpu-spot
    resources:
      limits:
        cpu: 1200m
        memory: 2576980377600m
        nvidia.com/gpu: "1"
      requests:
        cpu: "1"
        memory: 2Gi
        nvidia.com/gpu: "1"
  tolerations:
  - effect: NoSchedule
    key: type
    operator: Equal
    value: gpu-spot
  - effect: NoSchedule
    key: kubernetes.azure.com/scalesetpriority
    operator: Equal
    value: spot

Expected behavior A pod should be up and running, but instead it is stuck in the pending state. K8s events:

Warning  FailedScheduling  0s                     default-scheduler   0/4 nodes are available: 1 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/4 nodes are available: 1 No preemption victims fo │
│ und for incoming pod, 3 Preemption is not helpful for scheduling..

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

CLI Version 2.61.0
Kubernetes version 1.27.9
CLI Extension version [e.g. 1.7.5] if applicable
Browser: Safari, Firefox

Additional context Add any other context about the problem here.

Jul 18 '24 11:07 vkatrychenko

Any news on this?

Jul 30 '24 12:07 domirohner

Hi @vkatrychenko, have you confirmed that the NVIDIA device plugin is installed and the GPUs in your node pool are schedulable? For reference: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation

Jul 30 '24 16:07 sdesai345

Hi @sdesai345 ! Thanks for pointing me to the right direction. We use DaemonSet to install the Nvidia plugin. The DaemonSet has affinity rules only for regular node pools, so the plugin was not installed on Spot instances. We will fix it from our end. The issue can be closed.

Jul 31 '24 10:07 vkatrychenko

@vkatrychenko That's great to hear, and please follow-up if the issue persists.

Aug 01 '24 18:08 sdesai345

@vkatrychenko Can you explain how you fixed this issue? I'm experiencing the same thing right now.

Feb 06 '25 23:02 maxiedaniels

Spot instance-type/Node size: Standard_NC8as_T4_v3 Still facing this issue. Can you explain how this is fixed? @vkatrychenko @sdesai345 @allyford

Jul 10 '25 12:07 siddarth-devakumar

Hi @siddarth-devakumar have you tried to adjust the affinity rules on your NVIDIA k8s device plugin deployment to target both reserved compute and spot GPU node pools?

Jul 14 '25 15:07 sdesai345