armory
armory copied to clipboard
Support --gpus argument for Docker 19+
The CLI interface to run docker containers with GPU has changed since version 19 of Docker. On this line https://github.com/twosixlabs/armory/blob/d275056fa1b8b6c478047a0b1bd3d0a1a14fc73f/armory/eval/evaluator.py#L36 the --runtime=nvidia
argument is used. This has been depreciated and the correct argument to use from version 19 and onwards is --gpus all
(all can be replaced by the list of GPUs).
Pending https://github.com/docker/docker-py/pull/2471
Seeing as the docker-py PR has stagnated, we need to determine the instructions for installing "nvidia" runtime on docker19 so that method can be used.
@GuillaumeLeclerc Apologies for the slow turn around on this. It looks as though the runtime argument was deprecated prematurely since there is no support for --gpus argument in docker-py or docker-compose (both heavily used methods for launching docker containers).
You should be able to utilize the runtime argument on Docker 19+ as long as it is installed and configured in the daemon configuration file:
Method 1: Install nvidia-docker2 package https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)#ubuntu-distributions-1
Method 2: Install the container runtime: https://github.com/NVIDIA/nvidia-container-runtime#ubuntu-distributions
Modify the config file: https://github.com/NVIDIA/nvidia-container-runtime#daemon-configuration-file
Please see this docker-compose issue for more details: https://github.com/docker/compose/issues/6691
Please let us know if you're unable to install the runtime from either of these methods.
Method 2 worked for me. I don't know if you want to close this issue or not.
Method 2 worked for me. I don't know if you want to close this issue or not.
Glad to hear. We'll leave it open because we want to switch to the new method as soon as we're able to. Thanks for reporting.
https://github.com/docker/docker-py/pull/2471 has been approved. Once this merges we can update armory accordingly.
Hello, the docker-py PR has been merged. Is there any plan to update this dependency?
I encountered the same issue. I hope this will be fixed soon @davidslater
@mzweilin can you provide a stack trace for the issue you have?
I am unable to reproduce:
(anaconda3)david.slater@noether:~$ python
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import docker
>>> docker.__version__
'5.0.0'
>>> exit()
(anaconda3)david.slater@noether:~$ docker --version
Docker version 20.10.2, build 20.10.2-0ubuntu1~20.04.3
(anaconda3)david.slater@noether:~$ cd armory
(anaconda3)david.slater@noether:~/armory$ armory run scenario_configs/mnist_baseline.json --gpus=7
2022-01-13 16:35:51 noether armory.__main__[1032683] INFO --gpus field specified. Setting --use-gpu to True
2022-01-13 16:35:53 noether armory.docker.management[1032683] INFO ARMORY Instance 1130f7b52e created.
2022-01-13 16:35:53 noether armory.eval.evaluator[1032683] INFO Running evaluation script
...
It works fine and uses the GPU.
@davidslater
I think the root cause of my issue is that the Lambda Stack
(https://lambdalabs.com/lambda-stack-deep-learning-software) which we use in our Lambda workstation deleted the nvidia-container-runtime
package in a recent update because nvidia-container-toolkit
is used instead to hook the runc runtime in Docker 19+. There's no need to have an explicit nvidia docker runtime any more. But legacy software still rely on the nvidia-container-runtime
interface. I had configured an nvidia docker runtime that can be used by Armory's docker-py in the past, but the executable "/usr/bin/nvidia-container-runtime" was deleted since the update (It could be the result of me running sudo apt-get autoremove
occasionally.), which yielded to such an error message
$ armory run --use-gpu --gpus=0 some_scenario.json
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] ERROR Starting instance failed.
Traceback (most recent call last):
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
response.raise_for_status()
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http+docker://localhost/v1.41/containers/ce041c27e02069e6163ae8c403829877c9a24fdd644e26c1bd4b114fc25b6944/start
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/weilinxu/coder/armory/armory/eval/evaluator.py", line 214, in run
envs=self.extra_env_vars, ports=ports, user=self.get_id(),
File "/home/weilinxu/coder/armory/armory/docker/management.py", line 132, in start_armory_instance
self.name, runtime=self.runtime, envs=envs, ports=ports, user=user,
File "/home/weilinxu/coder/armory/armory/docker/management.py", line 68, in __init__
image_name, **container_args
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/models/containers.py", line 818, in run
container.start()
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/models/containers.py", line 404, in start
return self.client.api.start(self.id, **kwargs)
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/api/container.py", line 1111, in start
self._raise_for_status(res)
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/home/weilinxu/Envs/armory_13/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 400 Client Error for http+docker://localhost/v1.41/containers/ce041c27e02069e6163ae8c403829877c9a24fdd644e26c1bd4b114fc25b6944/start: Bad Request ("failed to create shim: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/ce041c27e02069e6163ae8c403829877c9a24fdd644e26c1bd4b114fc25b6944/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown")
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] ERROR Is Docker Daemon running?
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] INFO Deleting tmp_dir /home/weilinxu/.armory/tmp/2022-01-13T225617.481084
2022-01-13 14:56:17 lambda-dual-wx armory.eval.evaluator[30396] INFO Removing output_dir /home/weilinxu/coder/review_gard/results/2022-01-13T225617.481084 if empty
It's lengthy but the key message is /usr/bin/nvidia-container-runtime: no such file or directory: unknown
.
The solution is to install nvidia-container-runtime back to my workstation, as is suggested by @seanpmorgan in https://github.com/twosixlabs/armory/issues/157#issuecomment-604069980
So in Docker 19+, you would just use runtime "runc" and the GPU would just work?
Do you know if there is an easy way to check for nvidia-container-toolkit?
So in Docker 19+, you would just use runtime "runc" and the GPU would just work?
Yes, if you have nvidia-container-toolkit
and you specify something like --gpus all
to the Docker CLI.
Do you know if there is an easy way to check for nvidia-container-toolkit?
Both nvidia-container-runtime
and nvidia-container-toolkit
are executables that can be located by which ?
.
If a system has nvidia-container-runtime
(and the nvidia
runtime is configured correctly), the current Armory should run flawlessly.
If a system doesn't have nvidia-container-runtime
but nvidia-container-toolkit
, the runtime=nvidia
parameter wouldn't work. We should use the device_requests
argument instead in client.containers.run()
to expose GPUs, as in https://github.com/docker/docker-py/pull/2471 Basically, Armory would need to do some argument translation.
@davidslater I am looking into this, but might need a bit of discussion....can we chat today?
Ok, so here are my notes so far:
First lets show that the old way (nvidia-container-runtime=2.0.0+docker18.03.1-1
and docker-ce=5:18.09.9~3-0~ubuntu-bionic
aka Docker 18) works:
sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-runtime=2.0.0+docker18.03.1-1 -y
sudo apt-get install docker-ce=5:18.09.9~3-0~ubuntu-bionic docker-ce-cli=5:18.09.9~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check
Now, lets try it with a fresh 19 instance and show that that works:
sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-runtime=2.0.0+docker18.03.1-1 -y
sudo apt-get install docker-ce=5:19.03.15~3-0~ubuntu-bionic docker-ce-cli=5:19.03.15~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check
If you upgrade docker to version 19 and updat nvidia-container-runtime to the most recent then it also works:
sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-runtime -y
sudo apt-get install docker-ce=5:19.03.15~3-0~ubuntu-bionic docker-ce-cli=5:19.03.15~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check
So the only way I can get this to fail is to explicitly remove the nvidia-container-runtime
package (or the /usr/bin/nvidia-container-runtime
wrapper):
sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install docker-ce=5:18.09.9~3-0~ubuntu-bionic docker-ce-cli=5:18.09.9~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check
but in this case even docker run hello-world
fails...so I would say that this is just an incorrect installation of docker.
If I then go and install nvidia-container-toolkit
, it works again:
sudo apt-get install nvidia-container-toolkit
docker run hello-world
armory run cifar10_baseline.json --use-gpu --gpu all --check
Note: this installed nvidia-container-toolkit/bionic,now 1.9.0-1 amd64 [installed]
which also creates the /usr/bin/nvidia-container-runtime
.
Doing it fresh with Docker v18 also works:
sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-toolkit=1.9.0-1 -y
sudo apt-get install docker-ce=5:18.09.9~3-0~ubuntu-bionic docker-ce-cli=5:18.09.9~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check
and with docker v19 it also works
sudo apt-get remove docker docker-engine docker.io containerd runc nvidia-container-runtime nvidia-container-toolkit -y
sudo apt autoremove -y
sudo apt-get install nvidia-container-toolkit=1.9.0-1 -y
sudo apt-get install docker-ce=5:19.03.15~3-0~ubuntu-bionic docker-ce-cli=5:19.03.15~3-0~ubuntu-bionic containerd.io docker-compose-plugin -y
armory run cifar10_baseline.json --use-gpu --gpu all --check
So, in the end, I don't think there is an issue. @davidslater, am I missing something?
@mzweilin can you weigh in here?