Unable To Use The GPU Node Pool On Azure AKS
Here is the setup of my AKS cluster:
AKS Versions: 1.29.2 type of node pools :3 , system pool, general node pool, and GPU tried NVIDIA driver plugins: Nvidia device plugin and GPU operator OS IMAGE: Ubuntu 22.04.4 LTS KERNEL VERSION: 5.15.0-1068-azure CONTAINER-RUNTIME: containerd://1.7.15-1 NVIDIA PLUGIN VERSION: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0, - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0
Documentation used to create the GPU node pool: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#install-nvidia-device-plugin
Here is the issue:
As per the above document, if the Nividia plugin driver is installed successfully then under the Capacity section, the GPU should list as nvidia.com/gpu: 1. However I did not see that when I described my GPU-enabled node.
I also tried the gpu-operator but that did not help either. I
I am having a similar issue. The plugin installs and my node has the capacity listed. However, any pod running on the GPU node cannot detect the device. Doing kubectl exec into the plugin pod and running nvidia-smi returns Failed to initialize NVML: Unknown Error running a pod with python and attempting to use torch results in a similar issue.
import torch
torch.cuda.is_available()
False
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
I have no problems when deploying the GPU operator on an AKS cluster and invoking the GPU. Values tested shown below.
# These values validated on v24.6.1 of the NVIDIA GPU Operator.
driver:
enabled: true
toolkit:
enabled: true
cdi:
enabled: false
nfd:
enabled: true
gfd:
enabled: true
migManager:
enabled: false
devicePlugin:
enabled: true
dcgmExporter:
enabled: true