research-contributions icon indicating copy to clipboard operation
research-contributions copied to clipboard

Why use torch.multiprocessing.spawn for distributed training

Open hw-ju opened this issue 2 years ago • 9 comments

Hi there,

In the Swin UNETR scripts, e.g., https://github.com/Project-MONAI/research-contributions/blob/main/SwinUNETR/BRATS21/main.py, torch.multiprocessing.spawn is used for launching distributed training. Any reason why you didn't use torch.distributed.launch? Did torch.multiprocessing.spawn give better performance than torch.distributed.launch for BraTS/BTCV-based Swin UNETR training?

Thanks!

hw-ju avatar Aug 14 '22 17:08 hw-ju

Hi @tangy5 ,

Could you please help share more information?

Thanks in advance.

Nic-Ma avatar Aug 23 '22 08:08 Nic-Ma

Hi @hw-ju , the SwinUNETR is tested of multi-GPU training with both DDP and MP Spawn. Both works well, no performance preference regarding different multi-GPU frameworks. You can safely use DDP. Thank you!

tangy5 avatar Sep 02 '22 17:09 tangy5

@tangy5 Thanks for the clarification!

hw-ju avatar Sep 02 '22 19:09 hw-ju

@tangy5 Hi, Thanks for your great work. Could you please give some hints a bout an issue. It takes more time for each step of training, it takes more time when I run the model in distributed mode. What do you think?

Jamshidhsp avatar Dec 08 '22 14:12 Jamshidhsp

@tangy5 Hi, Thanks for your great work. Could you please give some hints a bout an issue. It takes more time for each step of training, it takes more time when I run the model in distributed mode. What do you think?

Thanks. Happy to help. Can you provide more details or logs of the issue? Is the issue of when training with distributed mode, it takes more time than single GPU?

tangy5 avatar Dec 08 '22 15:12 tangy5

yes, that's the issue. I run the pretraining stage using the command as the same mentioned, (batch size=1), single GPU runs faster than multi gpu. single GPU utilizes 100%, but multi GPU doesn't get the full utilization.

Jamshidhsp avatar Dec 08 '22 16:12 Jamshidhsp

GPU utilization is another story. The utilization percentage can't be comparable since there are synchronization process when train within a minibatch in distributed mode. But overall, the multi-GPU training should takes less time training entire dataset, as batch-size is N x bs. If this is not the case, there is an issue. Please paste some logs if the DDP training take longer than single GPU train for entire dataset.

tangy5 avatar Dec 08 '22 16:12 tangy5

Thank you for clarification. Here are initial logs. single GPU, batch_size=1 image

2 GPUs, batch_size=2 image

multi GPU keeps taking longer time as number of GPUs increases. It will be worse if running with batch_size=1 on multi GPUs.

Jamshidhsp avatar Dec 08 '22 16:12 Jamshidhsp

Thank you for clarification. Here are initial logs. single GPU, batch_size=1 image

2 GPUs, batch_size=2 image

multi GPU keeps taking longer time as number of GPUs increases. It will be worse if running with batch_size=1 on multi GPUs.

I mean, yes, when training with single GPU, the batch size is 1, then train on 2 GPUs, batch size is 2, the time is expected to be longer but should be less than 2 x time of Single GPU for each step/iteration. You could see 2 GPUs training is faster here, but is not exactly 2x faster, it's ~1.7x faster.

tangy5 avatar Dec 08 '22 17:12 tangy5