vl-merging RuntimeError: Address already in use

Hi YiLin, When I tried to run fine-tuning code on VQA(4 gpus):

python run.py with data_root=${data_dir} num_gpus=4 num_nodes=1 task_finetune_vqa_square_randaug_base_image384_ufo \
    exp_name=ma_vqa_finetuning per_gpu_batchsize=4 batch_size=16 image_size=480 learning_rate=3e-5 \
    load_path=${load_path} log_dir=${log_dir} drop_rate=0.15 max_epoch=10 ufo \

I got the Error:

ERROR - VLMo - Failed after 0:00:20!
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 138, in _wrapping_function
    results = function(*args, **kwargs)
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 994, in _run
    self.strategy.setup_environment()
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 138, in setup_environment
    self.setup_distributed()
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 174, in setup_distributed
    _init_dist_connection(
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
    return TCPStore(
RuntimeError: Address already in use

I'm sure there is no process using the port. So what's the problem with my config(?) or the code?

Oct 19 '23 07:10 Aris-z

Err...the question is, because I use H800(sm_90), I can't use cuda unless I update my PyTorch to 2.0.0+. But if I did, there will be a compatibility problem between my PyTorch version and your code(PyTorch 1.10.1)... What should I do? Thanks.

Oct 19 '23 11:10 Aris-z

Could you try directly running the codes with the PyTorch 2.0+ version and see if there are any errors?

Nov 12 '23 18:11 ylsung