gpushare-device-plugin icon indicating copy to clipboard operation
gpushare-device-plugin copied to clipboard

containerd and nvidia-container-runtime instead of nvidia-docker2

Open Frank-17 opened this issue 2 years ago • 2 comments

Any chance to have the device plugin working on containerd without nvidia-docker2?

I have rebuild my cluster with Conteinerd and on my worker nodes the following are installed libnvidia-container nvidia-container-toolkit nvidia-container-runtime

but the device plugin rises the error:

0425 10:34:29.375414 1 main.go:18] Start gpushare device plugin I0425 10:34:29.382160 1 gpumanager.go:28] Loading NVML I0425 10:34:29.382601 1 gpumanager.go:31] Failed to initialize NVML: could not load NVML library. I0425 10:34:29.382616 1 gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to nvidia?

The default runtime has been setup to nvidia-container-runtime

[plugins."io.containerd.runtime.v1.linux"] no_shim = false runtime = "nvidia-container-runtime" runtime_root = "" shim = "containerd-shim" shim_debug = false

Anyone has found a workaround? Any plan to replace nvidia-docker2 with nvidia-container-runtime

Thanks

Frank-17 avatar Apr 25 '22 11:04 Frank-17

Yes I followed this to get containerd it running but I still have issues. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#containerd

vio-f avatar Jun 23 '22 13:06 vio-f

This is a great topic. Now that Kubernetes removed support for Docker as a container runtime. Has anyone found a workaround to implement GPU sharing with the containerd without issues? Thanks

has-avila avatar May 02 '24 18:05 has-avila