distributed_tutorial
distributed_tutorial copied to clipboard
Error distributed run
Hi, Thanks for the easy following tutorial on distributed processing. I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.
_```
Traceback (most recent call last):
File "conv_dist.py", line 117, in
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/work/codebase/torch_dist/conv_dist.py", line 74, in train model = DDP(model, device_ids=[gpu]) File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init self.broadcast_bucket_size) File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced dist._broadcast_coalesced(self.process_group, tensors, buffer_size) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Not able to figure out the cause of error.
Please help, thanks.
Setting NCCL_SOCKET_IFNAME solved this issue for me.
Setting NCCL_SOCKET_IFNAME solved this issue for me.
what value did you set it to?
Setting NCCL_SOCKET_IFNAME solved this issue for me.
what value did you set it to?
To my machine's network interface name