ray_lightning
ray_lightning copied to clipboard
num-workers clarification
Hi,
I am trying to use ray-lightning with Tune in a distributed environment but I am not sure on how to define the --num-workers
.
Could you please help me to clarify that? Let's say I have 2 machines each one with 2 GPUs, and I want to run each trial on 12 cpus and 2 gpus. What would be the correct definition of num_workers, num_cpus_per_worker
and use_gpu
?
Thanks!
Hey Roger, the number of GPUs in data parallel training with Pytorch Lightning is defined by the number of workers you're starting. Thus, you would use 2 workers here with 1 GPU and 6 CPUs each.
num_workers=2 num_cpus_per_worker=6 use_gpu=1
Thanks @krfricke !, so in this case EACH trial would use 6 cpus and 1 GPU? or each trial would use 12 cpus and 2 gpus?
@rogertrullo in this case each trial will run 1 data parallel training run, and that training run itself will be parallelized with 2 workers.
So it would look like this
- Trial 1: 12 CPUs, 2 GPUs
- Worker 1: 6 CPUs, 1 GPU
- Worker 2: 6 CPUs, 1 GPU
- Trial 2: 12 CPUs, 2 GPUs
- Worker 1: 6 CPUs, 1 GPU
- Worker 2: 6 CPUs, 1 GPU