k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

node system reboot k8s-device-plugin pod No devices were found

Open xqlang opened this issue 2 years ago • 6 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

Gpu Mode : NVIDIA-3060 check k8s-device-plugin status: CrashLoopBackOff version: 0.14.0

2. Steps to reproduce the issue

node system reboot check k8s-device-plugin status: CrashLoopBackOff pod logs 2023/06/12 09:21:24 Loading NVML 2023/06/12 09:21:25 Starting FS watcher. 2023/06/12 09:21:25 Starting OS watcher. 2023/06/12 09:21:25 Retreiving plugins. 2023/06/12 09:21:25 No devices found. Waiting indefinitely.

recorve: restart k8s-device-plugin status pod this pod cmd nvidia-smi

nvidia-smi

Tue Jun 13 10:53:26 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:3D:00.0 Off | N/A | | 30% 35C P8 9W / 170W | 0MiB / 12053MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:5E:00.0 Off | N/A | | 30% 36C P8 7W / 170W | 0MiB / 12053MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

3. Information to attach (optional if deemed irrelevant)

The pod is unable to recognize the GPU driver and needs to be restarted in order to recognize it.

xqlang avatar Jun 13 '23 11:06 xqlang

Temporary recovery method: First, the "nvidia-smi" command must be executed in the system, and then the pod of k8s-device-plugin needs to be restarted to restore functionality.

xqlang avatar Jun 14 '23 02:06 xqlang

Thanks a lot for your info, I found gpus: null in the log of plugin, and it restored after restart.

But I have another problem, my tfserving did not use gpu in the pod.

图片

图片

No running processes found in the monitor command.

And the env variable NVIDIA_VISIBLE_DEVICES can be found in the contaner, there are no other errors in tfserving.

KeithTt avatar Jun 18 '23 06:06 KeithTt

image Through Starce tracking, it was discovered that after the system boots up, the first execution of 'nvidia-smi' command generates '/dev/nvidia0'. After this file is generated, the GPU can be accessed and used for computations within the pod. Currently, the temporary workaround we are using is to add 'echo 'nvidia-smi -pm 1' >> /etc/rc.local' to avoid issues where the necessary files are not generated during the initial system loading."

xqlang avatar Jul 24 '23 03:07 xqlang

Because the NVIDIA driver is not GPL compatible, the kernel driver cannot create these device files for you, and some mechanism in user-space must always do this.

If you install the NVIDIA driver through standard packages on e.g. Ubuntu, udev rules are put in place to handle this for you. If you run on different distribution, or install the NVIDIA driver from its .run file, you have to take care of this yourself.

As you point out, running nvidia-smi is another way to trigger creation of these device nodes, as that is something initiated in user-space that has the ability to create the nodes for you.

klueska avatar Jul 24 '23 08:07 klueska

thank you! I installed it using .run file

xqlang avatar Jul 24 '23 11:07 xqlang

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]