intel-device-plugins-for-kubernetes icon indicating copy to clipboard operation
intel-device-plugins-for-kubernetes copied to clipboard

GPU: Error response from daemon: invalid volume specification

Open mzernovx opened this issue 2 years ago • 12 comments

Environment:

  • kubernetes 1.27.3
  • docker v20.10.20

Steps to reproduce:

  • Setup Intel Device Plugins
  • Create any pod with gpu.intel.com/i915 resource allocated

Expected behaviour: pod running

Actual behaviour: pod in CreateContainerError state Warning Failed 2m49s (x12 over 5m3s) kubelet Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:b7:00.0-card:/dev/dri/by-path/pci-0000:b7:00.0-card:ro'

Likely caused by this commit: https://github.com/intel/intel-device-plugins-for-kubernetes/commit/943e34f3af072929b342a71e8045124a6b32172a

mzernovx avatar Oct 12 '23 10:10 mzernovx

Thanks for reporting this. Did you verify that it's only on docker runtime?

tkatila avatar Oct 12 '23 11:10 tkatila

The change that is causing this was introduced on 0.26.1 version. You can workaround it by using 0.26.0 in the mean while.

tkatila avatar Oct 12 '23 11:10 tkatila

I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?

mythi avatar Oct 12 '23 12:10 mythi

@tkatila I can confirm that with containerd it's working fine.

mzernovx avatar Oct 12 '23 13:10 mzernovx

I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?

BMRA/VMRA uses docker as a default container runtime.

mzernovx avatar Oct 12 '23 13:10 mzernovx

docker v20.10.20

That's a bit old. Oldest Docker version listed e.g. in Ubuntu packages site is v20.10.21, and Ubuntu 20.04 LTS updates are already at 24.0.5: https://packages.ubuntu.com/focal-updates/docker.io

Have you tried any newer Docker version?

kubernetes 1.27.3 ... BMRA/VMRA uses docker as a default container runtime.

They could consider updating that default, as Kubernetes deprecated Docker support after k8s v1.20: https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/

eero-t avatar Oct 12 '23 13:10 eero-t

Have you tried any newer Docker version?

I tried a newer version and it reproduces with it:

$ dpkg --list | grep Docker
ii  docker-buildx-plugin                             0.11.2-1~ubuntu.22.04~jammy                 amd64        Docker Buildx cli plugin.
ii  docker-ce                                        5:24.0.6-1~ubuntu.22.04~jammy               amd64        Docker: the open-source application container engine
ii  docker-ce-cli                                    5:24.0.6-1~ubuntu.22.04~jammy               amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras                        5:24.0.6-1~ubuntu.22.04~jammy               amd64        Rootless support for Docker.
ii  docker-compose-plugin                            2.21.0-1~ubuntu.22.04~jammy                 amd64        Docker Compose (V2) plugin for the Docker CLI.

Pod fails with:

  Warning  Failed     8s (x2 over 9s)  kubelet            Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:00:02.0-card:/dev/dri/by-path/pci-0000:00:02.0-card:ro'

Docker Engine is mentioned in container runtimes in k8s docs: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker that would suggest it's still "ok" to use it.

But to me this is a bug with the docker engine as it works fine with containerd and cri-o. My thought process for this is:

  1. File a bug for the docker engine about it not being able to mount paths with :.
  2. https://github.com/intel/container-experience-kits for docker installation, stick with 0.26.0 GPU plugin
  3. If/when the docker engine bug is resolved, update the GPU plugin to the latest version

I do not want to remove the "by-path" mounting as it's required by distributed training. And adding some cli arg or env variable to temporarily disable it feels icky.

tkatila avatar Oct 13 '23 06:10 tkatila

It seems that a colon in volumes/binds is a known issue: https://github.com/docker/docker-py/issues/2041 https://github.com/moby/moby/issues/39293 https://github.com/moby/moby/issues/22825

tkatila avatar Oct 13 '23 06:10 tkatila

Looks like there's a workaround to use --mount arg with Docker but there's no clear way to utilize this from the side of Kubernetes.

The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx as they are basically symlinks to devices in /dev/dri

mzernovx avatar Oct 13 '23 07:10 mzernovx

The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx

Avoid using docker is not an option?

mythi avatar Oct 13 '23 07:10 mythi

Avoid using docker is not an option?

@mythi BMRA/VMRA still uses docker as a "primary" container runtime. The product is build around customers and their needs, so avoiding using Docker is not an option for us.

Downgrading Intel DP to 0.26.0 can be considered as a workaround, but not a fix.

mzernovx avatar Oct 13 '23 08:10 mzernovx