armory icon indicating copy to clipboard operation
armory copied to clipboard

Support --gpus argument for Docker 19+

Open GuillaumeLeclerc opened this issue 5 years ago • 15 comments

The CLI interface to run docker containers with GPU has changed since version 19 of Docker. On this line https://github.com/twosixlabs/armory/blob/d275056fa1b8b6c478047a0b1bd3d0a1a14fc73f/armory/eval/evaluator.py#L36 the --runtime=nvidia argument is used. This has been depreciated and the correct argument to use from version 19 and onwards is --gpus all (all can be replaced by the list of GPUs).

GuillaumeLeclerc avatar Feb 14 '20 14:02 GuillaumeLeclerc

Pending https://github.com/docker/docker-py/pull/2471

seanpmorgan avatar Feb 17 '20 16:02 seanpmorgan

Seeing as the docker-py PR has stagnated, we need to determine the instructions for installing "nvidia" runtime on docker19 so that method can be used.

seanpmorgan avatar Mar 20 '20 18:03 seanpmorgan

@GuillaumeLeclerc Apologies for the slow turn around on this. It looks as though the runtime argument was deprecated prematurely since there is no support for --gpus argument in docker-py or docker-compose (both heavily used methods for launching docker containers).

You should be able to utilize the runtime argument on Docker 19+ as long as it is installed and configured in the daemon configuration file:

Method 1: Install nvidia-docker2 package https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)#ubuntu-distributions-1

Method 2: Install the container runtime: https://github.com/NVIDIA/nvidia-container-runtime#ubuntu-distributions

Modify the config file: https://github.com/NVIDIA/nvidia-container-runtime#daemon-configuration-file

Please see this docker-compose issue for more details: https://github.com/docker/compose/issues/6691

Please let us know if you're unable to install the runtime from either of these methods.

seanpmorgan avatar Mar 25 '20 20:03 seanpmorgan

Method 2 worked for me. I don't know if you want to close this issue or not.

GuillaumeLeclerc avatar Apr 16 '20 03:04 GuillaumeLeclerc

Method 2 worked for me. I don't know if you want to close this issue or not.

Glad to hear. We'll leave it open because we want to switch to the new method as soon as we're able to. Thanks for reporting.

seanpmorgan avatar Apr 16 '20 03:04 seanpmorgan

https://github.com/docker/docker-py/pull/2471 has been approved. Once this merges we can update armory accordingly.

seanpmorgan avatar Jul 15 '20 16:07 seanpmorgan

Hello, the docker-py PR has been merged. Is there any plan to update this dependency?

Lodour avatar Mar 16 '21 22:03 Lodour

I encountered the same issue. I hope this will be fixed soon @davidslater

mzweilin avatar Jan 13 '22 06:01 mzweilin

@mzweilin can you provide a stack trace for the issue you have?

I am unable to reproduce:

(anaconda3)david.slater@noether:~$ python
Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import docker
>>> docker.__version__
'5.0.0'
>>> exit()
(anaconda3)david.slater@noether:~$ docker --version
Docker version 20.10.2, build 20.10.2-0ubuntu1~20.04.3
(anaconda3)david.slater@noether:~$ cd armory
(anaconda3)david.slater@noether:~/armory$ armory run scenario_configs/mnist_baseline.json --gpus=7
2022-01-13 16:35:51 noether armory.__main__[1032683] INFO --gpus field specified. Setting --use-gpu to True
2022-01-13 16:35:53 noether armory.docker.management[1032683] INFO ARMORY Instance 1130f7b52e created.
2022-01-13 16:35:53 noether armory.eval.evaluator[1032683] INFO Running evaluation script
...

It works fine and uses the GPU.

davidslater avatar Jan 13 '22 16:01 davidslater

@davidslater I think the root cause of my issue is that the Lambda Stack (https://lambdalabs.com/lambda-stack-deep-learning-software) which we use in our Lambda workstation deleted the nvidia-container-runtime package in a recent update because nvidia-container-toolkit is used instead to hook the runc runtime in Docker 19+. There's no need to have an explicit nvidia docker runtime any more. But legacy software still rely on the nvidia-container-runtime interface. I had configured an nvidia docker runtime that can be used by Armory's docker-py in the past, but the executable "/usr/bin/nvidia-container-runtime" was deleted since the update (It could be the result of me running sudo apt-get autoremove occasionally.), which yielded to such an error message

$ armory run --use-gpu --gpus=0 some_scenario.json
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] ERROR Starting instance failed.
Traceback (most recent call last):
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http+docker://localhost/v1.41/containers/ce041c27e02069e6163ae8c403829877c9a24fdd644e26c1bd4b114fc25b6944/start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/weilinxu/coder/armory/armory/eval/evaluator.py", line 214, in run
    envs=self.extra_env_vars, ports=ports, user=self.get_id(),
  File "/home/weilinxu/coder/armory/armory/docker/management.py", line 132, in start_armory_instance
    self.name, runtime=self.runtime, envs=envs, ports=ports, user=user,
  File "/home/weilinxu/coder/armory/armory/docker/management.py", line 68, in __init__
    image_name, **container_args
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/models/containers.py", line 818, in run
    container.start()
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/models/containers.py", line 404, in start
    return self.client.api.start(self.id, **kwargs)
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/api/container.py", line 1111, in start
    self._raise_for_status(res)
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 400 Client Error for http+docker://localhost/v1.41/containers/ce041c27e02069e6163ae8c403829877c9a24fdd644e26c1bd4b114fc25b6944/start: Bad Request ("failed to create shim: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/ce041c27e02069e6163ae8c403829877c9a24fdd644e26c1bd4b114fc25b6944/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown")
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] ERROR Is Docker Daemon running?
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] INFO Deleting tmp_dir /home/weilinxu/.armory/tmp/2022-01-13T225617.481084
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] INFO Removing output_dir /home/weilinxu/coder/review_gard/results/2022-01-13T225617.481084 if empty

It's lengthy but the key message is /usr/bin/nvidia-container-runtime: no such file or directory: unknown.

The solution is to install nvidia-container-runtime back to my workstation, as is suggested by @seanpmorgan in https://github.com/twosixlabs/armory/issues/157#issuecomment-604069980

mzweilin avatar Jan 13 '22 23:01 mzweilin

So in Docker 19+, you would just use runtime "runc" and the GPU would just work?

Do you know if there is an easy way to check for nvidia-container-toolkit?

davidslater avatar Jan 13 '22 23:01 davidslater

So in Docker 19+, you would just use runtime "runc" and the GPU would just work?

Yes, if you have nvidia-container-toolkit and you specify something like --gpus all to the Docker CLI.

Do you know if there is an easy way to check for nvidia-container-toolkit?

Both nvidia-container-runtime and nvidia-container-toolkit are executables that can be located by which ?.

If a system has nvidia-container-runtime (and the nvidia runtime is configured correctly), the current Armory should run flawlessly.

If a system doesn't have nvidia-container-runtime but nvidia-container-toolkit, the runtime=nvidia parameter wouldn't work. We should use the device_requests argument instead in client.containers.run() to expose GPUs, as in https://github.com/docker/docker-py/pull/2471 Basically, Armory would need to do some argument translation.

mzweilin avatar Jan 14 '22 06:01 mzweilin

@davidslater I am looking into this, but might need a bit of discussion....can we chat today?

shenshaw26 avatar Apr 26 '22 15:04 shenshaw26

Ok, so here are my notes so far:

First lets show that the old way (nvidia-container-runtime=2.0.0+docker18.03.1-1 and docker-ce=5:18.09.9~3-0~ubuntu-bionic aka Docker 18) works:

sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-runtime=2.0.0+docker18.03.1-1 -y
sudo apt-get install docker-ce=5:18.09.9~3-0~ubuntu-bionic docker-ce-cli=5:18.09.9~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check

Now, lets try it with a fresh 19 instance and show that that works:

sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-runtime=2.0.0+docker18.03.1-1 -y
sudo apt-get install docker-ce=5:19.03.15~3-0~ubuntu-bionic docker-ce-cli=5:19.03.15~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check

If you upgrade docker to version 19 and updat nvidia-container-runtime to the most recent then it also works:

sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-runtime -y
sudo apt-get install docker-ce=5:19.03.15~3-0~ubuntu-bionic docker-ce-cli=5:19.03.15~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check

So the only way I can get this to fail is to explicitly remove the nvidia-container-runtime package (or the /usr/bin/nvidia-container-runtime wrapper):

sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install docker-ce=5:18.09.9~3-0~ubuntu-bionic docker-ce-cli=5:18.09.9~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check

but in this case even docker run hello-world fails...so I would say that this is just an incorrect installation of docker.

If I then go and install nvidia-container-toolkit, it works again:

sudo apt-get install nvidia-container-toolkit
docker run hello-world
armory run cifar10_baseline.json --use-gpu --gpu all --check

Note: this installed nvidia-container-toolkit/bionic,now 1.9.0-1 amd64 [installed] which also creates the /usr/bin/nvidia-container-runtime.

Doing it fresh with Docker v18 also works:

sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-toolkit=1.9.0-1 -y
sudo apt-get install docker-ce=5:18.09.9~3-0~ubuntu-bionic docker-ce-cli=5:18.09.9~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check

and with docker v19 it also works

sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-toolkit=1.9.0-1 -y
sudo apt-get install docker-ce=5:19.03.15~3-0~ubuntu-bionic docker-ce-cli=5:19.03.15~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check

So, in the end, I don't think there is an issue. @davidslater, am I missing something?

shenshaw26 avatar Apr 26 '22 21:04 shenshaw26

@mzweilin can you weigh in here?

davidslater avatar Apr 27 '22 00:04 davidslater