k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

nvidia-device-plugin container crashing on nodes without GPU cards

Open peterableda opened this issue 6 years ago • 5 comments

1. Issue description

The pods on nodes without GPU cards are in CrashLoopBackOff state. The following message is logged.

container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/nvidia/bin/nvidia-container-cli --load-kmods configure --device=all --utility --pid=30846 /var/lib/docker/devicemapper/mnt/94defd8b562a5cb0bc9df1bdb718c9730c116cb959011c8919915be3e59f7d4f/rootfs]\\nnvidia-container-cli: initialization error: driver error: failed to process request\\n\""

The root cause is pretty obvious: Nvidia driver is not installed on this specific machine which is expected as I don't have GPU cards on it.

The question is how to run the k8s-device-plugin in heterogeneous clusters, some machines with/without GPUs. Having half of the daemonset pods in CrashLoopBackOff state doesn't cause issues but it looks pretty bad.

peterableda avatar May 28 '19 14:05 peterableda

I ended up manually labeling the nodes where I have GPUs and added a nodeSelector to the daemonset. This is the best I could come up with.

I also evaluated the NVIDIA/gpu-feature-discovery project but it has the same problem as this, the daemonset pods are crashing on non-gpu nodes.

peterableda avatar May 31 '19 09:05 peterableda

Hi, yes the best thing to do for the moment is to add a nodeSelector. Instead of adding labels manually, you may want to use the Node Feature Discovery with the PCI source. Concerning the GPU Feature Discovery, we will soon release a new version with a fix for this. Thank you for your report

jjacobelli avatar May 31 '19 16:05 jjacobelli

Using NFD is the recommended process. this will label nodes with GPUs and prevent nodes without GPUs from using the driver container.

nvjmayo avatar Jul 20 '20 18:07 nvjmayo

Hi! wondering if there is any update on this issue? I am having the same problem with my (non-GPU) management node in a 1 mgmt (VM instance) and 3 gpu node (3x DGX Stations) and a DeepOps 20.10 implementation. NFD containers appear but this doesn't seem to stop the mgmt node from accepting the nvidia-device-plugin container. Thanks in advance :)

kube-system              nvidia-device-plugin-2shct                    1/1     Running            3315       11d   10.233.113.15   gpu03    <none>           <none>
kube-system              nvidia-device-plugin-gbcdg                    1/1     Running            0          11d   10.233.112.2    gpu02    <none>           <none>
kube-system              nvidia-device-plugin-qlsfx                    0/1     CrashLoopBackOff   3324       11d   10.233.92.5     mgmt01   <none>           <none>
kube-system              nvidia-device-plugin-s9vv5                    1/1     Running            0          11d   10.233.91.2     gpu01    <none>           <none>
node-feature-discovery   gpu-feature-discovery-26d2w                   1/1     Running            0          11d   10.233.112.4    gpu02    <none>           <none>
node-feature-discovery   gpu-feature-discovery-b7hqk                   1/1     Running            3          57m   10.233.113.14   gpu03    <none>           <none>
node-feature-discovery   gpu-feature-discovery-j2xrl                   1/1     Running            0          11d   10.233.91.4     gpu01    <none>           <none>
node-feature-discovery   nfd-master-bc8c476d9-m79xp                    1/1     Running            0          11d   10.233.92.3     mgmt01   <none>           <none>
node-feature-discovery   nfd-worker-9bxsp                              1/1     Running            46         11d   10.233.113.16   gpu03    <none>           <none>
node-feature-discovery   nfd-worker-f4qhb                              1/1     Running            48         11d   10.233.92.4     mgmt01   <none>           <none>
node-feature-discovery   nfd-worker-qnm6p                              1/1     Running            22         11d   10.233.91.3     gpu01    <none>           <none>
node-feature-discovery   nfd-worker-s97k8                              1/1     Running            32         11d   10.233.112.3    gpu02    <none>           <none>

aaroncnb avatar Jan 05 '21 09:01 aaroncnb

The recommended method is to use a toleration or a node selector to only deploy the plugin on nodes that actually have GPU s on them.

If you really want to deploy it on all nodes (and just have it sit on a blocked state on non-GPU nodes), then you can set the failOniInitError flag to false when deploying it.

klueska avatar Jan 06 '21 12:01 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Mar 31 '24 04:03 github-actions[bot]