[BUG] Missing nvidia.com/gpu capacity in Spot nodes
Describe the bug Missing nvidia.com/gpu: 1 capacity in the Spot nodes
VM size: beta.kubernetes.io/instance-type=Standard_NC6s_v3
To Reproduce Steps to reproduce the behavior:
- Create a Spot GPU node pool:
resource "azurerm_kubernetes_cluster_node_pool" "gpu_spot_node_pool" {
name = "gpuspot"
kubernetes_cluster_id = module.kubernetes.cluster_id
kubelet_disk_type = "OS"
vm_size = "Standard_NC6s_v3"
priority = "Spot"
zones = ["1", "2", "3"]
node_count = "0"
min_count = "0"
max_count = "3"
enable_auto_scaling = "true"
enable_host_encryption = "true"
vnet_subnet_id = azurerm_subnet.kubeflow_aks_subnet.id
eviction_policy = "Delete"
spot_max_price = "-1"
node_labels = {
type = "gpu-spot",
"kubernetes.azure.com/scalesetpriority" = "spot"
}
node_taints = [
"type=gpu-spot:NoSchedule",
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule",
]
lifecycle {
ignore_changes = [
node_count
]
}
depends_on = [module.kubernetes]
}
- Then deploy a pod/job with a proper affinity/toleration rules and set k8s resources:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: In
values:
- gpu-spot
containers:
- name: test-gpu-spot
resources:
limits:
cpu: 1200m
memory: 2576980377600m
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
tolerations:
- effect: NoSchedule
key: type
operator: Equal
value: gpu-spot
- effect: NoSchedule
key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
Expected behavior A pod should be up and running, but instead it is stuck in the pending state. K8s events:
Warning FailedScheduling 0s default-scheduler 0/4 nodes are available: 1 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/4 nodes are available: 1 No preemption victims fo │
│ und for incoming pod, 3 Preemption is not helpful for scheduling..
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- CLI Version 2.61.0
- Kubernetes version 1.27.9
- CLI Extension version [e.g. 1.7.5] if applicable
- Browser: Safari, Firefox
Additional context Add any other context about the problem here.
Any news on this?
Hi @vkatrychenko, have you confirmed that the NVIDIA device plugin is installed and the GPUs in your node pool are schedulable? For reference: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation
Hi @sdesai345 ! Thanks for pointing me to the right direction. We use DaemonSet to install the Nvidia plugin. The DaemonSet has affinity rules only for regular node pools, so the plugin was not installed on Spot instances. We will fix it from our end. The issue can be closed.
@vkatrychenko That's great to hear, and please follow-up if the issue persists.
@vkatrychenko Can you explain how you fixed this issue? I'm experiencing the same thing right now.
Spot instance-type/Node size: Standard_NC8as_T4_v3 Still facing this issue. Can you explain how this is fixed? @vkatrychenko @sdesai345 @allyford
Hi @siddarth-devakumar have you tried to adjust the affinity rules on your NVIDIA k8s device plugin deployment to target both reserved compute and spot GPU node pools?