research-contributions Why use torch.multiprocessing.spawn for distributed training

Hi there,

In the Swin UNETR scripts, e.g., https://github.com/Project-MONAI/research-contributions/blob/main/SwinUNETR/BRATS21/main.py, torch.multiprocessing.spawn is used for launching distributed training. Any reason why you didn't use torch.distributed.launch? Did torch.multiprocessing.spawn give better performance than torch.distributed.launch for BraTS/BTCV-based Swin UNETR training?

Thanks!

Aug 14 '22 17:08 hw-ju

Hi @tangy5 ,

Could you please help share more information?

Thanks in advance.

Aug 23 '22 08:08 Nic-Ma

Hi @hw-ju , the SwinUNETR is tested of multi-GPU training with both DDP and MP Spawn. Both works well, no performance preference regarding different multi-GPU frameworks. You can safely use DDP. Thank you!

Sep 02 '22 17:09 tangy5

@tangy5 Thanks for the clarification!

Sep 02 '22 19:09 hw-ju

@tangy5 Hi, Thanks for your great work. Could you please give some hints a bout an issue. It takes more time for each step of training, it takes more time when I run the model in distributed mode. What do you think?

Dec 08 '22 14:12 Jamshidhsp

@tangy5 Hi, Thanks for your great work. Could you please give some hints a bout an issue. It takes more time for each step of training, it takes more time when I run the model in distributed mode. What do you think?

Thanks. Happy to help. Can you provide more details or logs of the issue? Is the issue of when training with distributed mode, it takes more time than single GPU?

Dec 08 '22 15:12 tangy5

yes, that's the issue. I run the pretraining stage using the command as the same mentioned, (batch size=1), single GPU runs faster than multi gpu. single GPU utilizes 100%, but multi GPU doesn't get the full utilization.

Dec 08 '22 16:12 Jamshidhsp

GPU utilization is another story. The utilization percentage can't be comparable since there are synchronization process when train within a minibatch in distributed mode. But overall, the multi-GPU training should takes less time training entire dataset, as batch-size is N x bs. If this is not the case, there is an issue. Please paste some logs if the DDP training take longer than single GPU train for entire dataset.

Dec 08 '22 16:12 tangy5

Thank you for clarification. Here are initial logs. single GPU, batch_size=1

2 GPUs, batch_size=2

multi GPU keeps taking longer time as number of GPUs increases. It will be worse if running with batch_size=1 on multi GPUs.

Dec 08 '22 16:12 Jamshidhsp

Thank you for clarification. Here are initial logs. single GPU, batch_size=1

2 GPUs, batch_size=2

multi GPU keeps taking longer time as number of GPUs increases. It will be worse if running with batch_size=1 on multi GPUs.

I mean, yes, when training with single GPU, the batch size is 1, then train on 2 GPUs, batch size is 2, the time is expected to be longer but should be less than 2 x time of Single GPU for each step/iteration. You could see 2 GPUs training is faster here, but is not exactly 2x faster, it's ~1.7x faster.

Dec 08 '22 17:12 tangy5

research-contributions research-contributions copied to clipboard

Why use torch.multiprocessing.spawn for distributed training

research-contributions
research-contributions copied to clipboard