research-contributions
research-contributions copied to clipboard
Why use torch.multiprocessing.spawn for distributed training
Hi there,
In the Swin UNETR scripts, e.g., https://github.com/Project-MONAI/research-contributions/blob/main/SwinUNETR/BRATS21/main.py, torch.multiprocessing.spawn
is used for launching distributed training. Any reason why you didn't use torch.distributed.launch
? Did torch.multiprocessing.spawn
give better performance than torch.distributed.launch
for BraTS/BTCV-based Swin UNETR training?
Thanks!
Hi @tangy5 ,
Could you please help share more information?
Thanks in advance.
Hi @hw-ju , the SwinUNETR is tested of multi-GPU training with both DDP and MP Spawn. Both works well, no performance preference regarding different multi-GPU frameworks. You can safely use DDP. Thank you!
@tangy5 Thanks for the clarification!
@tangy5 Hi, Thanks for your great work. Could you please give some hints a bout an issue. It takes more time for each step of training, it takes more time when I run the model in distributed mode. What do you think?
@tangy5 Hi, Thanks for your great work. Could you please give some hints a bout an issue. It takes more time for each step of training, it takes more time when I run the model in distributed mode. What do you think?
Thanks. Happy to help. Can you provide more details or logs of the issue? Is the issue of when training with distributed mode, it takes more time than single GPU?
yes, that's the issue. I run the pretraining stage using the command as the same mentioned, (batch size=1), single GPU runs faster than multi gpu. single GPU utilizes 100%, but multi GPU doesn't get the full utilization.
GPU utilization is another story. The utilization percentage can't be comparable since there are synchronization process when train within a minibatch in distributed mode. But overall, the multi-GPU training should takes less time training entire dataset, as batch-size is N x bs. If this is not the case, there is an issue. Please paste some logs if the DDP training take longer than single GPU train for entire dataset.
Thank you for clarification. Here are initial logs.
single GPU, batch_size=1
2 GPUs, batch_size=2
multi GPU keeps taking longer time as number of GPUs increases. It will be worse if running with batch_size=1 on multi GPUs.
Thank you for clarification. Here are initial logs. single GPU, batch_size=1
2 GPUs, batch_size=2
multi GPU keeps taking longer time as number of GPUs increases. It will be worse if running with batch_size=1 on multi GPUs.
I mean, yes, when training with single GPU, the batch size is 1, then train on 2 GPUs, batch size is 2, the time is expected to be longer but should be less than 2 x time of Single GPU for each step/iteration. You could see 2 GPUs training is faster here, but is not exactly 2x faster, it's ~1.7x faster.