k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

nvidia-device-plugin v0.18.0 failing to start

Open gabrielbussolo opened this issue 2 months ago • 7 comments

runtime is defined already on containerd

sudo crictl info | jq '.config.containerd.defaultRuntimeName'
"nvidia"

when applying the v0.18.0 with k apply

k apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.18.0/deployments/static/nvidia-device-plugin.yml

i get error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy

k logs nvidia-device-plugin-daemonset-fzltg -n kube-system
I1022 18:32:39.566652       1 main.go:239] "Starting NVIDIA Device Plugin" version=<
        3c9ffca9
        commit: 3c9ffca9491f0d2d362a7064138dfcd71bb57592
 >
I1022 18:32:39.566674       1 main.go:242] Starting FS watcher for /var/lib/kubelet/device-plugins
I1022 18:32:39.566692       1 main.go:249] Starting OS watcher.
I1022 18:32:39.566840       1 main.go:264] Starting Plugins.
I1022 18:32:39.566851       1 main.go:321] Loading configuration.
I1022 18:32:39.567169       1 main.go:346] Updating config with default resource matching patterns.
I1022 18:32:39.567245       1 main.go:357] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdrcopyEnabled": false,
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1022 18:32:39.567250       1 main.go:360] Retrieving plugins.
E1022 18:32:39.567304       1 factory.go:113] Incompatible strategy detected auto
E1022 18:32:39.567309       1 factory.go:114] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1022 18:32:39.567311       1 factory.go:115] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1022 18:32:39.567312       1 factory.go:116] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1022 18:32:39.567314       1 factory.go:117] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E1022 18:32:39.567388       1 main.go:177] error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy

v0.17.4 works fine

$ k apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.4/deployments/static/nvidia-device-plugin.yml
daemonset.apps/nvidia-device-plugin-daemonset created
$ kubectl get pods -n kube-system | grep nvidia                                                                                   
nvidia-device-plugin-daemonset-ss9lm      0/1     ContainerCreating   0          8s
$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-ss9lm      1/1     Running     0          12s

log from v0.17.4:

$ k logs nvidia-device-plugin-daemonset-ss9lm -n kube-system
I1022 18:38:35.316289       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
        fd56a747
        commit: fd56a747defe15333adce40fcd3a06ffb129251b
 >
I1022 18:38:35.316319       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1022 18:38:35.316336       1 main.go:245] Starting OS watcher.
I1022 18:38:35.316431       1 main.go:260] Starting Plugins.
I1022 18:38:35.316442       1 main.go:317] Loading configuration.
I1022 18:38:35.316944       1 main.go:342] Updating config with default resource matching patterns.
I1022 18:38:35.317034       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1022 18:38:35.317039       1 main.go:356] Retrieving plugins.
I1022 18:38:35.331421       1 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I1022 18:38:35.331842       1 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1022 18:38:35.332662       1 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

gabrielbussolo avatar Oct 22 '25 18:10 gabrielbussolo

@gabrielbussolo just to get more information what system are you running on and which version of the NVIDIA Container Toolkit is installed?

elezar avatar Oct 23 '25 10:10 elezar

Hi @elezar

Please check my comment here

I'm observing the same problem and I've written some interesting findings in the comment.

gilgameshfreedom avatar Oct 24 '25 19:10 gilgameshfreedom

I have same issue

mahmoudk1000 avatar Oct 28 '25 15:10 mahmoudk1000

mark

I1112 09:53:22.296743 1 main.go:235] "Starting NVIDIA Device Plugin" version=< 3c378193 commit: 3c378193fcebf6e955f0d65bd6f2aeed099ad8ea

I1112 09:53:22.296762 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins I1112 09:53:22.296776 1 main.go:245] Starting OS watcher. I1112 09:53:22.296891 1 main.go:260] Starting Plugins. I1112 09:53:22.296910 1 main.go:317] Loading configuration. I1112 09:53:22.297200 1 main.go:342] Updating config with default resource matching patterns. I1112 09:53:22.297341 1 main.go:353] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "nvidiaDevRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "deviceDiscoveryStrategy": "auto", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "imex": {} } I1112 09:53:22.297348 1 main.go:356] Retrieving plugins. E1112 09:53:22.297442 1 factory.go:112] Incompatible strategy detected auto E1112 09:53:22.297449 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E1112 09:53:22.297452 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E1112 09:53:22.297455 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E1112 09:53:22.297458 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I1112 09:53:22.297463 1 main.go:381] No devices found. Waiting indefinitely.

inyohh avatar Nov 12 '25 10:11 inyohh

same issue, anyone found a workaround/fix ?

would presume we need to set DEVICE_DISCOVERY_STRATEGY to null, rather than auto, but this is not allowed via env variable ?

riccardo32 avatar Nov 26 '25 15:11 riccardo32

try to reboot the server @riccardo32

inyohh avatar Nov 27 '25 02:11 inyohh

DEVICE_DISCOVERY_STRATEGY

UPDATE: Seems mine was not the plugin, but rather the update to the container toolkit 1.18.0 it doesn't recognise null or auto as DEVICE_DISCOVERY_STRATEGY anymore, had to set env variable to nvml, this worked.

     env:
        - name: DEVICE_DISCOVERY_STRATEGY
          value: nvml

riccardo32 avatar Nov 27 '25 16:11 riccardo32