ray_lightning icon indicating copy to clipboard operation
ray_lightning copied to clipboard

num-workers clarification

Open rogertrullo opened this issue 2 years ago • 3 comments

Hi, I am trying to use ray-lightning with Tune in a distributed environment but I am not sure on how to define the --num-workers. Could you please help me to clarify that? Let's say I have 2 machines each one with 2 GPUs, and I want to run each trial on 12 cpus and 2 gpus. What would be the correct definition of num_workers, num_cpus_per_worker and use_gpu? Thanks!

rogertrullo avatar Mar 28 '22 19:03 rogertrullo

Hey Roger, the number of GPUs in data parallel training with Pytorch Lightning is defined by the number of workers you're starting. Thus, you would use 2 workers here with 1 GPU and 6 CPUs each.

num_workers=2 num_cpus_per_worker=6 use_gpu=1

krfricke avatar Mar 28 '22 20:03 krfricke

Thanks @krfricke !, so in this case EACH trial would use 6 cpus and 1 GPU? or each trial would use 12 cpus and 2 gpus?

rogertrullo avatar Mar 29 '22 05:03 rogertrullo

@rogertrullo in this case each trial will run 1 data parallel training run, and that training run itself will be parallelized with 2 workers.

So it would look like this

  • Trial 1: 12 CPUs, 2 GPUs
    • Worker 1: 6 CPUs, 1 GPU
    • Worker 2: 6 CPUs, 1 GPU
  • Trial 2: 12 CPUs, 2 GPUs
    • Worker 1: 6 CPUs, 1 GPU
    • Worker 2: 6 CPUs, 1 GPU

amogkam avatar Apr 08 '22 19:04 amogkam