Incompatible strategy detected auto No devices found. Waiting indefinitely
Hello Guys, i have the k8s-device-plugin running on my second master kubernetes node which contains underlying gpu which is Geforce RTX 2060. The gpu works fine and i can run my machine learning trainings using docker also. I dont understand why the k8s-device plugin container in my kubernetes cluster can not see the gpu. I am using [CRI-O] as a default runtime and also have set it to default using --set-as-default. I would appreciate it if someone could assist me. Below i have put screenshots of the kubernetes device plugin container issue and also output of my nvidia-smi which shows that my gpu is working fine and also that my crio sevice is running.
`I0123 16:54:07.204118 1 main.go:235] "Starting NVIDIA Device Plugin" version=< d475b2cf commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
I0123 16:54:07.204992 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins I0123 16:54:07.205170 1 main.go:245] Starting OS watcher. I0123 16:54:07.205480 1 main.go:260] Starting Plugins. I0123 16:54:07.205513 1 main.go:317] Loading configuration. I0123 16:54:07.206134 1 main.go:342] Updating config with default resource matching patterns. I0123 16:54:07.206267 1 main.go:353] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "nvidiaDevRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "deviceDiscoveryStrategy": "auto", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "imex": {} } I0123 16:54:07.206276 1 main.go:356] Retrieving plugins. E0123 16:54:07.206553 1 factory.go:112] Incompatible strategy detected auto E0123 16:54:07.206570 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0123 16:54:07.206574 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0123 16:54:07.206577 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0123 16:54:07.206580 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0123 16:54:07.206583 1 main.go:381] No devices found. Waiting indefinitely. `
@smartlocus could you exec into the device plugin container and confirm that you can run nvidia-smi in that container. If this works, then the device plugin should be detecting the available devices. If not, then the injection of the driver and devices from the host into its container is not working as expected.
What is your current crio config? How is the NVIDIA Container Toolkit installed?
I have a similar issue. When executing into the container, I cannot invoke nvidia-smi.
[]$ kubectl exec -it nvidia-device-plugin-daemonset-hsxjh -- bash
[root@nvidia-device-plugin-daemonset-hsxjh /]# nvidia-smi
Failed to initialize NVML: Insufficient Permissions
How to know which permissions it needs? in the past, I run RHEL8, the same setting works well, however, after upgrade the host to RHEL9, the plugin failed. From the host I can execute nvidia-smi well.
here is my config for crio:
[crio]
[crio.image]
pause_image = "private-registry.mydomain.com/admin-relay/pause:3.10"
[crio.runtime]
default_runtime = "nvidia"
[crio.runtime.runtimes]
[crio.runtime.runtimes.crun]
allowed_annotations = ["io.containers.trace-syscall"]
monitor_path = "/usr/libexec/crio/conmon"
runtime_path = "/usr/libexec/crio/crun-1.19.1"
runtime_root = "/run/crun"
[crio.runtime.runtimes.nvidia]
monitor_path = "/usr/libexec/crio/conmon"
runtime_path = "/usr/bin/nvidia-container-runtime"
runtime_root = "/run/runc"
runtime_type = "oci"
[crio.runtime.runtimes.runc]
monitor_path = "/usr/libexec/crio/conmon"
runtime_path = "/usr/libexec/crio/runc"
runtime_root = "/run/runc"
The discovery job runs well and show me the capability of my host, but the driver plugin just fail to recognize the GPU.
I0306 14:33:21.279351 1 main.go:235] "Starting NVIDIA Device Plugin" version=<
d475b2cf
commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
>
I0306 14:33:21.279463 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I0306 14:33:21.279538 1 main.go:245] Starting OS watcher.
I0306 14:33:21.279776 1 main.go:260] Starting Plugins.
I0306 14:33:21.279817 1 main.go:317] Loading configuration.
I0306 14:33:21.280669 1 main.go:342] Updating config with default resource matching patterns.
I0306 14:33:21.280863 1 main.go:353]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I0306 14:33:21.280873 1 main.go:356] Retrieving plugins.
E0306 14:33:21.296498 1 factory.go:93] Failed to initialize NVML: Insufficient Permissions.
E0306 14:33:21.296509 1 factory.go:94] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0306 14:33:21.296513 1 factory.go:95] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0306 14:33:21.296518 1 factory.go:96] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0306 14:33:21.296522 1 factory.go:97] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
W0306 14:33:21.296526 1 factory.go:101] nvml init failed: Insufficient Permissions
I0306 14:33:21.296534 1 main.go:381] No devices found. Waiting indefinitely.
I found the error. I have to disable selinux in crio.conf to have it works.