distributed_tutorial Error with distributed mp

Error with distributed mp

Open jyu-theartofml opened this issue 4 years ago • 1 comments

Hi, I tried running my code like your example, and I got this error

File "artGAN512_impre_v8.py", line 286, in main
 mp.spawn(train, nprocs=args.gpus, args=(args,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ubuntu/dcgan/artGAN512_impre_v8.py", line 167, in train
    world_size=args.world_size, rank=rank)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection timed out

Under my train function, i have

rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://', 
                            world_size=args.world_size, rank=rank)
torch.manual_seed(0)
torch.cuda.set_device(gpu)

I think it has something to do with the os.environ['MASTER_ADDR'] , can you explain how you chose value for that parameter? I'm using an AWS instance.

Thanks.

Apr 26 '20 01:04 jyu-theartofml

@jyu-theartofml From the PyTorch tutorial: MASTER_ADDR is the address of rank 0 node and MASTER_PORT is a free port on the machine with rank 0

May 20 '20 19:05 ruipeterpan

distributed_tutorial distributed_tutorial copied to clipboard

Error with distributed mp

distributed_tutorial
distributed_tutorial copied to clipboard