clearml-agent icon indicating copy to clipboard operation
clearml-agent copied to clipboard

Splitting GPUs across multiple trains agents on the same machine

Open bomri opened this issue 4 years ago • 3 comments

Hi,

I am trying to run two trains agents in daemon docker mode on a 4 GPU machine and split GPU allocation 2 for each. I get the following error: docker: Error response from daemon: cannot set both Count and DeviceIDs on device request. The command I'm running is: trains-agent daemon --queue <NAME> --docker "<ECR> -v /root:/root --ipc=host --privileged" --gpus 0,1

(I am using an internal docker image which is based on one of Nvidia's images)

I also tried using:

  • the CUDA_VISIBLE_DEVICES flag
  • -e NVIDIA_VISIBLE_DEVICES=0,1 in the docker cmd
  • not using the privileged flag

But either getting the same error or having all 4 GPUs allocated to the agent.

bomri avatar Oct 22 '20 09:10 bomri

Hi @bomri

Seems like a quoting issue running nvidia-docker, see here

What is the OS / docker / nvidia drivers you have on the machine running the trains-agent? Could you try to quote the (i.e. --gpus "0,1" , instead of --gpus 0,1 )? Do you see the agent registered as "myname:gpu0,1" ?

BTW: the --docker will set the default docker image+args , notice the args will not be used if an experiment is specifying its own arguments. If you always want to add arguments for the docker execution use extra_docker_arguments in your trains.conf

bmartinn avatar Oct 22 '20 15:10 bmartinn

Hello, I have a similar problem.

I have used trains-agent daemon --detached --gpus 0 --queue default --docker nvcr.io/nvidia/pytorch:20.08-py3 to successfully create an agent called "username:gpu0". Now, I would like to:

  1. create a new agent/worker on the same GPU.
  2. create a new queue (some custom name) on a specific agent.

I fail in both. I get: trains_agent: ERROR: Instance with the same WORKER_ID [username:gpu0] is already running I feel I'm missing a fundamental step/concept. How can this be achieved?

Thank a lot in advance!

[EDIT: If it makes any difference, we are trying to run HPO with Optuna on a 2-GPU machine, and therefore trying to distribute the tasks across different/same GPUs.]

majdzr avatar Nov 23 '20 12:11 majdzr

Hi @majdzr

Yes, by default trains-agent will tell you there is already an agent with the same name/gpu, which makes sense as sharing GPU is risky. Basically the compute resource can easily be shared, but memory allocation cannot, meaning if one process/agent allocates all the GPU memory the second process/agent will fail on GPU mem allocation.

All that aside, you can still achieve what you are after:

TRAINS_WORKER_ID=username:gpu0a trains-agent daemon --detached --gpus 0 --queue secondary --create-queue --docker nvcr.io/nvidia/pytorch:20.08-py3

Notes:

--create-queue will make sure that if a queue named "secondary" does not exist, it will be created. TRAINS_WORKER_ID allows you to override the agent's unique id, thus allowing you to spin two agents on the same GPU

bmartinn avatar Nov 23 '20 23:11 bmartinn