coreos-nvidia icon indicating copy to clipboard operation
coreos-nvidia copied to clipboard

nvidia-docker v2?

Open thomas-riccardi opened this issue 8 years ago • 8 comments

Hi, Are there plans to use nvidia-docker v2 (now merged into master: new official version) ?

It is simpler to use: https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0

thomas-riccardi avatar Nov 13 '17 09:11 thomas-riccardi

Above links are broken. I guess it's because 2.0 branch was merged into master recently by means of https://github.com/NVIDIA/nvidia-docker/commit/fe1874942b896df074ca1b5b819bc6a2ca9e8151

rporres avatar Nov 24 '17 11:11 rporres

@rporres indeed, I updated my comment.

thomas-riccardi avatar Nov 24 '17 13:11 thomas-riccardi

Its requires any changes? The current version was done for bare docker, not even nvidia-docker 1.0

mcuadros avatar Dec 22 '17 10:12 mcuadros

using nvidia-docker v2 would simplify the docker run part: no need to add:

--volumes-from nvidia-driver \
    --env PATH=$PATH:/opt/nvidia/bin/ \
    --env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nvidia/lib \
    $(for d in /dev/nvidia*; do echo -n "--device $d "; done) \

So what change is required is in fact installing nvidia-docker v2 in coreos, and removing the nvidia-driver container.

thomas-riccardi avatar Dec 22 '17 17:12 thomas-riccardi

I used the following steps to install nvidia-docker v2 (very hacky though):

  1. install nvidia driver
  2. instead of the volume I simply copy the files to the host, e.g.
/usr/bin/docker run --rm --volume /opt/nvidia/current:/output srcd/coreos-nvidia:${VERSION} cp -a /opt/nvidia/. /output/
  1. install libnvidia-container
  2. (build and) install nvidia-container-runtime
  3. create small bash scripts in /run/torcx/bin for nvidia-container{-runtime,-runtime-hook,-cli} to make sure they are accessible by docker and libraries are in LD_LIBRARY_PATH
  4. create /etc/docker/daemon.json and set default runtime to nvidia
  5. restart docker
  6. add the nvidia-docker bash scripts

There is only one issue currently: The nvidia-container-runtime somehow (even though same commit as installed runc) has a regression. And fails to run containers with docker run --security-opt=no-new-privileges (https://github.com/coreos/bugs/issues/1796).

trevex avatar Jan 29 '18 09:01 trevex

We have it working as well (nvidia-docker v2 + coreos + k8s device plugin). We will try to clean it up and hopefully be able to share it soonish.

lsjostro avatar Feb 02 '18 15:02 lsjostro

went for this instead https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/54

lsjostro avatar Feb 09 '18 12:02 lsjostro

@lsjostro I would be interested in having your previous "nvidia-docker v2 + coreos" version, even if not cleaned up and production-ready: nvidia-docker v2 enables sharing GPUs between containers (at the cost of losing k8s scheduling) that device drivers solutions don't support (and won't for the foreseeable future).

In any case, https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/54 is useful too, thanks for that !

thomas-riccardi avatar Feb 09 '18 13:02 thomas-riccardi