k8s-device-plugin
k8s-device-plugin copied to clipboard
nvidia-device-plugin container crashing on nodes without GPU cards
1. Issue description
The pods on nodes without GPU cards are in CrashLoopBackOff state. The following message is logged.
container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/nvidia/bin/nvidia-container-cli --load-kmods configure --device=all --utility --pid=30846 /var/lib/docker/devicemapper/mnt/94defd8b562a5cb0bc9df1bdb718c9730c116cb959011c8919915be3e59f7d4f/rootfs]\\nnvidia-container-cli: initialization error: driver error: failed to process request\\n\""
The root cause is pretty obvious: Nvidia driver is not installed on this specific machine which is expected as I don't have GPU cards on it.
The question is how to run the k8s-device-plugin in heterogeneous clusters, some machines with/without GPUs. Having half of the daemonset pods in CrashLoopBackOff state doesn't cause issues but it looks pretty bad.
I ended up manually labeling the nodes where I have GPUs and added a nodeSelector to the daemonset. This is the best I could come up with.
I also evaluated the NVIDIA/gpu-feature-discovery project but it has the same problem as this, the daemonset pods are crashing on non-gpu nodes.
Hi, yes the best thing to do for the moment is to add a nodeSelector. Instead of adding labels manually, you may want to use the Node Feature Discovery with the PCI source. Concerning the GPU Feature Discovery, we will soon release a new version with a fix for this. Thank you for your report
Using NFD is the recommended process. this will label nodes with GPUs and prevent nodes without GPUs from using the driver container.
Hi! wondering if there is any update on this issue? I am having the same problem with my (non-GPU) management node in a 1 mgmt (VM instance) and 3 gpu node (3x DGX Stations) and a DeepOps 20.10 implementation. NFD containers appear but this doesn't seem to stop the mgmt node from accepting the nvidia-device-plugin container. Thanks in advance :)
kube-system nvidia-device-plugin-2shct 1/1 Running 3315 11d 10.233.113.15 gpu03 <none> <none>
kube-system nvidia-device-plugin-gbcdg 1/1 Running 0 11d 10.233.112.2 gpu02 <none> <none>
kube-system nvidia-device-plugin-qlsfx 0/1 CrashLoopBackOff 3324 11d 10.233.92.5 mgmt01 <none> <none>
kube-system nvidia-device-plugin-s9vv5 1/1 Running 0 11d 10.233.91.2 gpu01 <none> <none>
node-feature-discovery gpu-feature-discovery-26d2w 1/1 Running 0 11d 10.233.112.4 gpu02 <none> <none>
node-feature-discovery gpu-feature-discovery-b7hqk 1/1 Running 3 57m 10.233.113.14 gpu03 <none> <none>
node-feature-discovery gpu-feature-discovery-j2xrl 1/1 Running 0 11d 10.233.91.4 gpu01 <none> <none>
node-feature-discovery nfd-master-bc8c476d9-m79xp 1/1 Running 0 11d 10.233.92.3 mgmt01 <none> <none>
node-feature-discovery nfd-worker-9bxsp 1/1 Running 46 11d 10.233.113.16 gpu03 <none> <none>
node-feature-discovery nfd-worker-f4qhb 1/1 Running 48 11d 10.233.92.4 mgmt01 <none> <none>
node-feature-discovery nfd-worker-qnm6p 1/1 Running 22 11d 10.233.91.3 gpu01 <none> <none>
node-feature-discovery nfd-worker-s97k8 1/1 Running 32 11d 10.233.112.3 gpu02 <none> <none>
The recommended method is to use a toleration or a node selector to only deploy the plugin on nodes that actually have GPU s on them.
If you really want to deploy it on all nodes (and just have it sit on a blocked state on non-GPU nodes), then you can set the failOniInitError flag to false when deploying it.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.