intel-device-plugins-for-kubernetes
intel-device-plugins-for-kubernetes copied to clipboard
GPU: Error response from daemon: invalid volume specification
Environment:
- kubernetes 1.27.3
- docker v20.10.20
Steps to reproduce:
- Setup Intel Device Plugins
- Create any pod with gpu.intel.com/i915 resource allocated
Expected behaviour: pod running
Actual behaviour:
pod in CreateContainerError state
Warning Failed 2m49s (x12 over 5m3s) kubelet Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:b7:00.0-card:/dev/dri/by-path/pci-0000:b7:00.0-card:ro'
Likely caused by this commit: https://github.com/intel/intel-device-plugins-for-kubernetes/commit/943e34f3af072929b342a71e8045124a6b32172a
Thanks for reporting this. Did you verify that it's only on docker runtime?
The change that is causing this was introduced on 0.26.1 version. You can workaround it by using 0.26.0 in the mean while.
I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?
@tkatila I can confirm that with containerd it's working fine.
I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?
BMRA/VMRA uses docker as a default container runtime.
docker v20.10.20
That's a bit old. Oldest Docker version listed e.g. in Ubuntu packages site is v20.10.21, and Ubuntu 20.04 LTS updates are already at 24.0.5: https://packages.ubuntu.com/focal-updates/docker.io
Have you tried any newer Docker version?
kubernetes 1.27.3 ... BMRA/VMRA uses docker as a default container runtime.
They could consider updating that default, as Kubernetes deprecated Docker support after k8s v1.20: https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/
Have you tried any newer Docker version?
I tried a newer version and it reproduces with it:
$ dpkg --list | grep Docker
ii docker-buildx-plugin 0.11.2-1~ubuntu.22.04~jammy amd64 Docker Buildx cli plugin.
ii docker-ce 5:24.0.6-1~ubuntu.22.04~jammy amd64 Docker: the open-source application container engine
ii docker-ce-cli 5:24.0.6-1~ubuntu.22.04~jammy amd64 Docker CLI: the open-source application container engine
ii docker-ce-rootless-extras 5:24.0.6-1~ubuntu.22.04~jammy amd64 Rootless support for Docker.
ii docker-compose-plugin 2.21.0-1~ubuntu.22.04~jammy amd64 Docker Compose (V2) plugin for the Docker CLI.
Pod fails with:
Warning Failed 8s (x2 over 9s) kubelet Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:00:02.0-card:/dev/dri/by-path/pci-0000:00:02.0-card:ro'
Docker Engine is mentioned in container runtimes in k8s docs: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker that would suggest it's still "ok" to use it.
But to me this is a bug with the docker engine as it works fine with containerd and cri-o. My thought process for this is:
- File a bug for the docker engine about it not being able to mount paths with
:. - https://github.com/intel/container-experience-kits for docker installation, stick with 0.26.0 GPU plugin
- If/when the docker engine bug is resolved, update the GPU plugin to the latest version
I do not want to remove the "by-path" mounting as it's required by distributed training. And adding some cli arg or env variable to temporarily disable it feels icky.
It seems that a colon in volumes/binds is a known issue: https://github.com/docker/docker-py/issues/2041 https://github.com/moby/moby/issues/39293 https://github.com/moby/moby/issues/22825
Looks like there's a workaround to use --mount arg with Docker but there's no clear way to utilize this from the side of Kubernetes.
The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx as they are basically symlinks to devices in /dev/dri
The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx
Avoid using docker is not an option?
Avoid using docker is not an option?
@mythi BMRA/VMRA still uses docker as a "primary" container runtime. The product is build around customers and their needs, so avoiding using Docker is not an option for us.
Downgrading Intel DP to 0.26.0 can be considered as a workaround, but not a fix.