distributed_tutorial Error distributed run

Hi, Thanks for the easy following tutorial on distributed processing. I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.

_``` Traceback (most recent call last): File "conv_dist.py", line 117, in main() File "conv_dist.py", line 51, in main mp.spawn(train, nprocs=args.gpus, args=(args,), join=True) File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/work/codebase/torch_dist/conv_dist.py", line 74, in train model = DDP(model, device_ids=[gpu]) File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init self.broadcast_bucket_size) File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced dist._broadcast_coalesced(self.process_group, tensors, buffer_size) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8


Not able to figure out the cause of error. 
Please help, thanks.

Jun 30 '20 04:06 snash4

Setting NCCL_SOCKET_IFNAME solved this issue for me.

Nov 12 '20 02:11 vperekadan

Setting NCCL_SOCKET_IFNAME solved this issue for me.

what value did you set it to?

Feb 11 '21 21:02 JingchaoZhang

Setting NCCL_SOCKET_IFNAME solved this issue for me.

what value did you set it to?

To my machine's network interface name

Feb 14 '21 18:02 vperekadan

distributed_tutorial distributed_tutorial copied to clipboard

Error distributed run

distributed_tutorial
distributed_tutorial copied to clipboard