gpu-operator
gpu-operator copied to clipboard
When there are nodes with containerd runtime in the cluster, it can cause nodes running on the Docker runtime to break down
Issue or feature description
My K8s Cluster has 2 nodes with NVIDIA GPU:
- node1 containerRuntime is docker
- node2 containerRuntime is containerd
$ k get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready control-plane 261d v1.24.10 100.64.4.51 <none> Ubuntu 20.04.5 LTS 5.4.0-166-generic docker://20.10.20
node1 Ready compute 261d v1.24.10 100.64.4.181 <none> Ubuntu 20.04.5 LTS 5.4.0-144-generic docker://20.10.20
node2 Ready compute 259d v1.24.10 100.64.4.62 <none> Ubuntu 20.04.5 LTS 5.4.0-144-generic containerd://1.6.4
Docker in node1 will crash after deploying GPU-Operator v23.9.0
The reason is: GPU Operator set runtime as containerd -- if >=1 node is configured with containerd(reference). Then GPU Operator set RUNTIME as containerd in daemonset nvidia-container-toolkit-daemonset whose pod running in nodes node1 and node2.
Expectation
GPU operator support nodes in the cluster that use both containerd and Docker as containerRuntimes at the same time。
@quanguachong we do no support this configuration currently. You can make this work by installing container-toolkit packages manually on the node and disabled toolkit container with the gpu-operator. This scenario is documented here: