gpushare-device-plugin
gpushare-device-plugin copied to clipboard
containerd and nvidia-container-runtime instead of nvidia-docker2
Any chance to have the device plugin working on containerd without nvidia-docker2?
I have rebuild my cluster with Conteinerd and on my worker nodes the following are installed libnvidia-container nvidia-container-toolkit nvidia-container-runtime
but the device plugin rises the error:
0425 10:34:29.375414 1 main.go:18] Start gpushare device plugin
I0425 10:34:29.382160 1 gpumanager.go:28] Loading NVML
I0425 10:34:29.382601 1 gpumanager.go:31] Failed to initialize NVML: could not load NVML library.
I0425 10:34:29.382616 1 gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to nvidia
?
The default runtime has been setup to nvidia-container-runtime
[plugins."io.containerd.runtime.v1.linux"] no_shim = false runtime = "nvidia-container-runtime" runtime_root = "" shim = "containerd-shim" shim_debug = false
Anyone has found a workaround? Any plan to replace nvidia-docker2 with nvidia-container-runtime
Thanks
Yes I followed this to get containerd it running but I still have issues. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#containerd
This is a great topic. Now that Kubernetes removed support for Docker as a container runtime. Has anyone found a workaround to implement GPU sharing with the containerd without issues? Thanks