AKS
AKS copied to clipboard
Ubuntu GPU nodepool fails to install nvidia-device-plugin
Describe the bug
When following this guide https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool
The nvidia-device-plugin is failing to detect the GPU on Ubuntu Linux OS, when doing a kubectl exec into the device plugin pod and manually running the startup script nvidia-device-plugin I get the following error NVML: Unknown Error
Additionally the GPU-enabled workload meant to test the gpu nodes, does not work in either the UbuntuLinux or the AzureLinux os skus
To Reproduce Steps to reproduce the behavior:
- Create a gpu nodepool (node_vm_size: Standard_NC6s_v3, os_sku: UbuntuLinux)
- Create the gpu-resources namespace
- Create and apply the nvidia-device-plugin daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- Check that GPUs are schedulable
kubectl get nodeskubectl describe <node name>
Name: <node name>
Roles: agent
Labels: accelerator=nvidia
[...]
Capacity:
[...]
nvidia.com/gpu: 1
[...]
- Find the nvidia-device-plugin pod with
kubectl get pods -n gpu-resources - Exec into the pod with
kubectl exec -it <pod-name> -n gpu-resources -- /bin/bash - Run
nvidia-smiand it will throw an error instead of printing device details
Expected behavior The nvidia device plugin should work on the UbuntuLinux os sku. I have confirmed it is working on the AzureLinux os sku but we require it to function on Ubuntu and the documentation suggests that it should.
Environment (please complete the following information):
- CLI Version 2.56.0
- Kubernetes version 1.29.4
Additional context
I0913 14:57:53.424196 23 main.go:199] Starting FS watcher.
I0913 14:57:53.424264 23 main.go:206] Starting OS watcher.
I0913 14:57:53.424503 23 main.go:221] Starting Plugins.
I0913 14:57:53.424525 23 main.go:278] Loading configuration.
I0913 14:57:53.425286 23 main.go:303] Updating config with default resource matching patterns.
I0913 14:57:53.425456 23 main.go:314]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0913 14:57:53.425487 23 main.go:317] Retrieving plugins.
E0913 14:57:53.433018 23 factory.go:68] Failed to initialize NVML: Unknown Error.
E0913 14:57:53.433038 23 factory.go:69] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0913 14:57:53.433047 23 factory.go:70] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0913 14:57:53.433054 23 factory.go:71] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0913 14:57:53.433061 23 factory.go:72] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
W0913 14:57:53.433070 23 factory.go:76] nvml init failed: Unknown Error
I0913 14:57:53.433081 23 main.go:346] No devices found. Waiting indefinitely.
Update: After accessing the node through a privileged pod I was able to run sudo nvidia-ctk runtime configure --runtime=containerd and then sudo systemctl restart containerd
After all the pods started up again the nvidia-device-plugin was able to detect the gpu and the tensorflow example ran successfully .
This isn't a viable workaround though because after the node is restarted for any reason, the config reverts back to the default.
AKS will need to change the container runtime for containerd on GPU enabled nodes, or enable users to configure it at the nodepool level.
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
@edoyon90 Did you find any workaround? I have some limitations from my company organization and I can't use bastion neither access directly the ssh via terminal to try update individually the nvidia plugins at the GPU nodes
Issue needing attention of @Azure/aks-leads
We are experiencing very similar problems. Following for updates. Fingers crossed for 2026
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Hi @edoyon90, thanks for reporting this issue. We have tried following the repro steps and were only able to reproduce it intermittently. If you are still facing this issue in your GPU nodepools, could you please open a support ticket so we can take a closer look at your cluster? Thank you!
Hi there :wave: AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.
Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.
Please do mention this issue in the case description so our teams can coordinate to help you. When you have created the support ticket, please add the case number as a comment to this issue to help us with tracking.
Thank you!
This issue will now be closed because it hasn't had any activity for 7 days after stale. @edoyon90 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.