distributed-pytorch One question about DDP

One question about DDP

Open cswwp opened this issue 5 years ago • 2 comments

@richardkxu Nice repo. One question: Is there difference between Single node, multiple GPUs with torch.distributed.launch (①) and Single node, multiple GPUs with multi-processes(②)? or they are equalize, and just two different method?

①

②

Nov 21 '20 15:11 cswwp

The main difference is which distributed training library you use. The 1st one uses NVIDIA Apex library. The 2nd one uses torch.nn.DistributedDataParallel. The 1st one gives better perf and works better with NVIDIA GPUs. It also becomes the default way in newer version of pytorch (> 1.6.0). Hope this is helpful!

Nov 21 '20 17:11 richardkxu

The main difference is which distributed training library you use. The 1st one uses NVIDIA Apex library. The 2nd one uses torch.nn.DistributedDataParallel. The 1st one gives better perf and works better with NVIDIA GPUs. It also becomes the default way in newer version of pytorch (> 1.6.0). Hope this is helpful!

Thank you, very helpful

Nov 22 '20 07:11 cswwp

distributed-pytorch distributed-pytorch copied to clipboard

One question about DDP

distributed-pytorch
distributed-pytorch copied to clipboard