distributed_tutorial
distributed_tutorial copied to clipboard
Error with distributed mp
Hi, I tried running my code like your example, and I got this error
File "artGAN512_impre_v8.py", line 286, in main
mp.spawn(train, nprocs=args.gpus, args=(args,))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/ubuntu/dcgan/artGAN512_impre_v8.py", line 167, in train
world_size=args.world_size, rank=rank)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection timed out
Under my train
function, i have
rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://',
world_size=args.world_size, rank=rank)
torch.manual_seed(0)
torch.cuda.set_device(gpu)
I think it has something to do with the os.environ['MASTER_ADDR']
, can you explain how you chose value for that parameter? I'm using an AWS instance.
Thanks.
@jyu-theartofml From the PyTorch tutorial: MASTER_ADDR
is the address of rank 0 node and MASTER_PORT
is a free port on the machine with rank 0