clearml-agent
clearml-agent copied to clipboard
Splitting GPUs across multiple trains agents on the same machine
Hi,
I am trying to run two trains agents in daemon docker mode on a 4 GPU machine and split GPU allocation 2 for each.
I get the following error:
docker: Error response from daemon: cannot set both Count and DeviceIDs on device request.
The command I'm running is:
trains-agent daemon --queue <NAME> --docker "<ECR> -v /root:/root --ipc=host --privileged" --gpus 0,1
(I am using an internal docker image which is based on one of Nvidia's images)
I also tried using:
- the CUDA_VISIBLE_DEVICES flag
-
-e NVIDIA_VISIBLE_DEVICES=0,1
in the docker cmd - not using the
privileged
flag
But either getting the same error or having all 4 GPUs allocated to the agent.
Hi @bomri
Seems like a quoting issue running nvidia-docker, see here
What is the OS / docker / nvidia drivers you have on the machine running the trains-agent
?
Could you try to quote the (i.e. --gpus "0,1"
, instead of --gpus 0,1
)?
Do you see the agent registered as "myname:gpu0,1" ?
BTW: the --docker will set the default docker image+args , notice the args will not be used if an experiment is specifying its own arguments. If you always want to add arguments for the docker execution use extra_docker_arguments in your trains.conf
Hello, I have a similar problem.
I have used trains-agent daemon --detached --gpus 0 --queue default --docker nvcr.io/nvidia/pytorch:20.08-py3
to successfully create an agent called "username:gpu0". Now, I would like to:
- create a new agent/worker on the same GPU.
- create a new queue (some custom name) on a specific agent.
I fail in both.
I get: trains_agent: ERROR: Instance with the same WORKER_ID [username:gpu0] is already running
I feel I'm missing a fundamental step/concept. How can this be achieved?
Thank a lot in advance!
[EDIT: If it makes any difference, we are trying to run HPO with Optuna on a 2-GPU machine, and therefore trying to distribute the tasks across different/same GPUs.]
Hi @majdzr
Yes, by default trains-agent
will tell you there is already an agent with the same name/gpu, which makes sense as sharing GPU is risky. Basically the compute resource can easily be shared, but memory allocation cannot, meaning if one process/agent allocates all the GPU memory the second process/agent will fail on GPU mem allocation.
All that aside, you can still achieve what you are after:
TRAINS_WORKER_ID=username:gpu0a trains-agent daemon --detached --gpus 0 --queue secondary --create-queue --docker nvcr.io/nvidia/pytorch:20.08-py3
Notes:
--create-queue
will make sure that if a queue named "secondary" does not exist, it will be created.
TRAINS_WORKER_ID
allows you to override the agent's unique id, thus allowing you to spin two agents on the same GPU