distributed-pytorch icon indicating copy to clipboard operation
distributed-pytorch copied to clipboard

One question about DDP

Open cswwp opened this issue 5 years ago • 2 comments

@richardkxu Nice repo. One question: Is there difference between Single node, multiple GPUs with torch.distributed.launch (①) and Single node, multiple GPUs with multi-processes(②)? or they are equalize, and just two different method?

image

image

cswwp avatar Nov 21 '20 15:11 cswwp

The main difference is which distributed training library you use. The 1st one uses NVIDIA Apex library. The 2nd one uses torch.nn.DistributedDataParallel. The 1st one gives better perf and works better with NVIDIA GPUs. It also becomes the default way in newer version of pytorch (> 1.6.0). Hope this is helpful!

richardkxu avatar Nov 21 '20 17:11 richardkxu

The main difference is which distributed training library you use. The 1st one uses NVIDIA Apex library. The 2nd one uses torch.nn.DistributedDataParallel. The 1st one gives better perf and works better with NVIDIA GPUs. It also becomes the default way in newer version of pytorch (> 1.6.0). Hope this is helpful!

Thank you, very helpful

cswwp avatar Nov 22 '20 07:11 cswwp