clearml-agent
clearml-agent copied to clipboard
Agent with specific GPU is spawning docker container with NVIDIA_VISIBLE_DEVICES=all
Agent with specific GPU is spawning docker container with NVIDIA_VISIBLE_DEVICES=all .
Agent created with command:
clearml-agent daemon --gpus 1 --queue default --docker
When job is runned by it, it executes following command: Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '--privileged', '-e', 'CLEARML_WORKER_ID=694d7fc1852a:gpu1', ..., 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id a3857511e46b4063aba159f00fde9d4a']
I would expect it to run command with NVIDIA_VISIBLE_DEVICES set to correct value, or even better let it be as it was set by docker runtime.
Thanks @AzaelCicero !
I would expect it to run command with NVIDIA_VISIBLE_DEVICES set to correct value,
Yes one would expect that but life is strange ;)
The way the nvidia docker runtime works, is that inside the docker (if executed with --gpus
flag) only the selected GPUs are available.
For example let's assume we have a dual gpu machine, GPU_0 and GPU_1.
We are running a docker on GPU_1 with docker run --gpus 1 ...
, that means that we assigned only GPU_1 to the container, and inside the container we only see a single GPU, but the GPU index inside the container will be "0" (because the gpu index id always start at zero). By setting NVIDIA_VISIBLE_DEVICES=all
inside the docker, we are basically telling the process it can use all the GPUs the nvidia docker runtime environment allocated for the container instance.
Make sense ?
It makes sense and its even inline with docker documentation. However it is in contradiction to my observations on latest Ubuntu 20.04 software stack and images (nvidia/cuda:11.2.0-devel)
Let me get through those commands and their outputs
╭─2021-02-25 08:26:01 kpawelczyk@AS-PC007 ~
╰─$ docker pull nvidia/cuda:11.2.0-devel
11.2.0-devel: Pulling from nvidia/cuda
Digest: sha256:68f7fbf7c6fb29340f4351c94b2309c43c98a5ffe46db1d6fa4f7c262fc223cb
Status: Image is up to date for nvidia/cuda:11.2.0-devel
docker.io/nvidia/cuda:11.2.0-devel
╭─2021-02-25 08:26:11 kpawelczyk@AS-PC007 ~
╰─$ nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
╭─2021-02-25 08:26:15 kpawelczyk@AS-PC007 ~
╰─$ ll /proc/driver/nvidia/gpus
total 0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:06:00.0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:07:00.0
╭─2021-02-25 08:26:17 kpawelczyk@AS-PC007 ~
╰─$ docker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash
root@5d4b12d55688:/# nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
root@5d4b12d55688:/# ll /proc/driver/nvidia/gpus
total 0
drwxr-xr-x 3 root root 60 Feb 25 07:27 ./
dr-xr-xr-x 3 root root 120 Feb 25 07:27 ../
dr-xr-xr-x 2 root root 0 Feb 24 07:05 0000:07:00.0/
root@5d4b12d55688:/# cat /proc/driver/nvidia/gpus/0000:07:00.0/information
Model: GeForce RTX 2060 SUPER
IRQ: 79
GPU UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3
Video BIOS: 90.06.44.80.56
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:07:00.0
Device Minor: 1
Blacklisted: No
root@5d4b12d55688:/# echo $NVIDIA_VISIBLE_DEVICES
1
root@5d4b12d55688:/# exit
╭─2021-02-25 08:32:36 kpawelczyk@AS-PC007 ~
╰─$ docker --version
Docker version 19.03.13, build 4484c46d9d
I didn't change anything in configuration of Docker. My assumption is that NVIDIA changed behaviour of their docker runtime.
Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?
@AzaelCicero I see the issue --privileged
will cause the docker to see all the GPU's, see issue here
You can quickly verify:
docker run --gpus device=1 -it nvidia/cuda:11.2.0-devel bash
then inside the docker run 'nvidia-smi -L(you should see a single GPU) on the contrary if you run
docker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash then inside the docker run 'nvidia-smi -L
, you will get both GPUs
Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?
Good question, I think there was, dockers with leftover environment variables (I think), if NVIDIA_VISIBLE_DEVICES
is not set or is set to all
it is basically the same. It also tells the agent running inside the docker that it has GPU support (it need that information to look for the correct pytorch
package based on the cuda
version, for example)
Is there a reasonyou have the --privileged
on ?
Maybe we should detect it and set the NVIDIA_VISIBLE_DEVICES accordingly ?
Thanks @bmartinn for pointing out the impact of privileged
I was not aware of this behaviour.
My environment requires -privileged
. I am leveraging Docker In Docker as a clearml-agent runner image, as I am running software which is spreaded into multiple Docker containers. Currently I have an workaround which as first step in run script is detecting which devices are available based on the content of /proc/driver/nvidia/gpus/ and corrects value of NVIDIA_VISIBLE_DEVICES.
I don't think that this is normal use case for clearml, but I like the idea of detecting privileged
mode. I think that it should be easy as it can only be introduced by entry in agent.default_docker.arguments
setting. I will try to propose the solution if I will find time to spare.
@AzaelCicero I think you are right, even though not "traditional" setup, I think clearml-agent
should properly handle it.
In order for the agent to pass the correct "NVIDIA_VISIBLE_DEVICES" (i.e. understanding the --gpus
is ignored) it should know it is also passing --privileged
flag
Maybe we should add a configuration / flag saying all Task's dockers should always run with --privileged
(then we know we need to change the NVIDIA_VISIBLE_DEVICES behavior), WDYT?
@bmartinn @AzaelCicero I think I'm running into a similar issue here except within a k8s cluster: https://github.com/NVIDIA/nvidia-docker/issues/1686
Have you done anything further with any wrapper scripts to make this work smoother / as intended with a combination gpu limit and --privilege mode?
I'm trying to get rootless dind in the k8s cluster to get around this as well to no avail so far.
@dcarrion87 I have ended up with the "temporary" workaround. At the start of the script ran by the ClearML agent I discover what GPU is available to the runner.
import os
import re
def discover_gpus():
"""
Scan /proc/driver/nvidia/gpus directory and identify active GPUs ids.
"""
available = os.popen(
'cat /proc/driver/nvidia/gpus/*/information | grep "Device Minor"'
).read()
matches = re.findall(r"^Device Minor:.*([0-9])+", available, re.MULTILINE)
return ",".join(matches)
os.environ["NVIDIA_VISIBLE_DEVICES"] = available_gpus
It is good enough and I have not found time to propose permanent solution.
Awesome thanks for sharing. Not sure if this will help but we ended up bundling up something that avoids privileged mode and bundled a rootless dind thanks to hints from other projects.
https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind
pá 7. 10. 2022 v 7:56 odesílatel Daniel Carrion @.***> napsal:
Awesome thanks for sharing. Not sure if this will help but we ended up bundling up something that avoids privileged mode and bundled a rootless dind thanks to hints from other projects.
https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind
— Reply to this email directly, view it on GitHub https://github.com/allegroai/clearml-agent/issues/47#issuecomment-1271145044, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATXMPNDOYCV4KES24HWDWALWB63SJANCNFSM4YEQC2DQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I think Im way too unskliled but more crazieve, pnormaly tell me what how lets go
@dcarrion87 just to be sure I understand, only when running the clearml-agent
With --privileged
the issue exists (because without privileged, inside the docker the visible devices are only the ones specified with --gpus hence "all" is the correct value).
Is that accurate ?
If you define CLEARML_DOCKER_SKIP_GPUS_FLAG=1
when runnign the clearml-agent
it will essentially skip setting the --gpus
flag when launching the docker and set the correct NVIDIA_VISIBLE_DEVICES
based on the selected GPUs.
Basically try:
CLEARML_DOCKER_SKIP_GPUS_FLAG=1 clearml-agent daemon ...
Hi @bmartinn we're not actually using clearml I just came across this from another similar issue. Sorry for hijacking this discussion!