AKS Ubuntu GPU nodepool fails to install nvidia-device-plugin

Describe the bug When following this guide https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool The nvidia-device-plugin is failing to detect the GPU on Ubuntu Linux OS, when doing a kubectl exec into the device plugin pod and manually running the startup script nvidia-device-plugin I get the following error NVML: Unknown Error Additionally the GPU-enabled workload meant to test the gpu nodes, does not work in either the UbuntuLinux or the AzureLinux os skus

To Reproduce Steps to reproduce the behavior:

Create a gpu nodepool (node_vm_size: Standard_NC6s_v3, os_sku: UbuntuLinux)
Create the gpu-resources namespace
Create and apply the nvidia-device-plugin daemonset

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Check that GPUs are schedulable kubectl get nodes kubectl describe <node name>

Name:               <node name>
Roles:              agent
Labels:             accelerator=nvidia

[...]

Capacity:
[...]
 nvidia.com/gpu:                 1
[...]

Find the nvidia-device-plugin pod with kubectl get pods -n gpu-resources
Exec into the pod with kubectl exec -it <pod-name> -n gpu-resources -- /bin/bash
Run nvidia-smi and it will throw an error instead of printing device details

Expected behavior The nvidia device plugin should work on the UbuntuLinux os sku. I have confirmed it is working on the AzureLinux os sku but we require it to function on Ubuntu and the documentation suggests that it should.

Environment (please complete the following information):

CLI Version 2.56.0
Kubernetes version 1.29.4

Additional context

I0913 14:57:53.424196      23 main.go:199] Starting FS watcher.
I0913 14:57:53.424264      23 main.go:206] Starting OS watcher.
I0913 14:57:53.424503      23 main.go:221] Starting Plugins.
I0913 14:57:53.424525      23 main.go:278] Loading configuration.
I0913 14:57:53.425286      23 main.go:303] Updating config with default resource matching patterns.
I0913 14:57:53.425456      23 main.go:314] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0913 14:57:53.425487      23 main.go:317] Retrieving plugins.
E0913 14:57:53.433018      23 factory.go:68] Failed to initialize NVML: Unknown Error.
E0913 14:57:53.433038      23 factory.go:69] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0913 14:57:53.433047      23 factory.go:70] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0913 14:57:53.433054      23 factory.go:71] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0913 14:57:53.433061      23 factory.go:72] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
W0913 14:57:53.433070      23 factory.go:76] nvml init failed: Unknown Error
I0913 14:57:53.433081      23 main.go:346] No devices found. Waiting indefinitely.

Sep 18 '24 15:09 edoyon90

Update: After accessing the node through a privileged pod I was able to run sudo nvidia-ctk runtime configure --runtime=containerd and then sudo systemctl restart containerd After all the pods started up again the nvidia-device-plugin was able to detect the gpu and the tensorflow example ran successfully .

This isn't a viable workaround though because after the node is restarted for any reason, the config reverts back to the default.

AKS will need to change the container runtime for containerd on GPU enabled nodes, or enable users to configure it at the nodepool level.

Oct 11 '24 13:10 edoyon90

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure