AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] Missing nvidia.com/gpu capacity in Spot nodes

Open vkatrychenko opened this issue 1 year ago • 2 comments

Describe the bug Missing nvidia.com/gpu: 1 capacity in the Spot nodes

изображение

VM size: beta.kubernetes.io/instance-type=Standard_NC6s_v3

To Reproduce Steps to reproduce the behavior:

  1. Create a Spot GPU node pool:
resource "azurerm_kubernetes_cluster_node_pool" "gpu_spot_node_pool" {
  name                   = "gpuspot"
  kubernetes_cluster_id  = module.kubernetes.cluster_id
  kubelet_disk_type      = "OS"
  vm_size                = "Standard_NC6s_v3"
  priority               = "Spot"
  zones                  = ["1", "2", "3"]
  node_count             = "0"
  min_count              = "0"
  max_count              = "3"
  enable_auto_scaling    = "true"
  enable_host_encryption = "true"
  vnet_subnet_id         = azurerm_subnet.kubeflow_aks_subnet.id
  eviction_policy        = "Delete"
  spot_max_price         = "-1"
  node_labels = {
    type                                    = "gpu-spot",
    "kubernetes.azure.com/scalesetpriority" = "spot"
  }
  node_taints = [
    "type=gpu-spot:NoSchedule",
    "kubernetes.azure.com/scalesetpriority=spot:NoSchedule",
  ]

  lifecycle {
    ignore_changes = [
      node_count
    ]
  }

  depends_on = [module.kubernetes]
}
  1. Then deploy a pod/job with a proper affinity/toleration rules and set k8s resources:
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: type
            operator: In
            values:
            - gpu-spot
  containers:
  - name: test-gpu-spot
    resources:
      limits:
        cpu: 1200m
        memory: 2576980377600m
        nvidia.com/gpu: "1"
      requests:
        cpu: "1"
        memory: 2Gi
        nvidia.com/gpu: "1"
  tolerations:
  - effect: NoSchedule
    key: type
    operator: Equal
    value: gpu-spot
  - effect: NoSchedule
    key: kubernetes.azure.com/scalesetpriority
    operator: Equal
    value: spot

Expected behavior A pod should be up and running, but instead it is stuck in the pending state. K8s events:

Warning  FailedScheduling  0s                     default-scheduler   0/4 nodes are available: 1 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/4 nodes are available: 1 No preemption victims fo │
│ und for incoming pod, 3 Preemption is not helpful for scheduling..

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • CLI Version 2.61.0
  • Kubernetes version 1.27.9
  • CLI Extension version [e.g. 1.7.5] if applicable
  • Browser: Safari, Firefox

Additional context Add any other context about the problem here.

vkatrychenko avatar Jul 18 '24 11:07 vkatrychenko

Any news on this?

domirohner avatar Jul 30 '24 12:07 domirohner

Hi @vkatrychenko, have you confirmed that the NVIDIA device plugin is installed and the GPUs in your node pool are schedulable? For reference: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation

sdesai345 avatar Jul 30 '24 16:07 sdesai345

Hi @sdesai345 ! Thanks for pointing me to the right direction. We use DaemonSet to install the Nvidia plugin. The DaemonSet has affinity rules only for regular node pools, so the plugin was not installed on Spot instances. We will fix it from our end. The issue can be closed.

vkatrychenko avatar Jul 31 '24 10:07 vkatrychenko

@vkatrychenko That's great to hear, and please follow-up if the issue persists.

sdesai345 avatar Aug 01 '24 18:08 sdesai345

@vkatrychenko Can you explain how you fixed this issue? I'm experiencing the same thing right now.

maxiedaniels avatar Feb 06 '25 23:02 maxiedaniels

Spot instance-type/Node size: Standard_NC8as_T4_v3 Still facing this issue. Can you explain how this is fixed? @vkatrychenko @sdesai345 @allyford

siddarth-devakumar avatar Jul 10 '25 12:07 siddarth-devakumar

Hi @siddarth-devakumar have you tried to adjust the affinity rules on your NVIDIA k8s device plugin deployment to target both reserved compute and spot GPU node pools?

sdesai345 avatar Jul 14 '25 15:07 sdesai345