vl-merging
vl-merging copied to clipboard
RuntimeError: Address already in use
Hi YiLin, When I tried to run fine-tuning code on VQA(4 gpus):
python run.py with data_root=${data_dir} num_gpus=4 num_nodes=1 task_finetune_vqa_square_randaug_base_image384_ufo \
exp_name=ma_vqa_finetuning per_gpu_batchsize=4 batch_size=16 image_size=480 learning_rate=3e-5 \
load_path=${load_path} log_dir=${log_dir} drop_rate=0.15 max_epoch=10 ufo \
I got the Error:
ERROR - VLMo - Failed after 0:00:20!
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 138, in _wrapping_function
results = function(*args, **kwargs)
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 994, in _run
self.strategy.setup_environment()
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 138, in setup_environment
self.setup_distributed()
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 174, in setup_distributed
_init_dist_connection(
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/cpfs/29cd2992fe666f2a/user/wangzekun/miniconda3/envs/vlmerge/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
return TCPStore(
RuntimeError: Address already in use
I'm sure there is no process using the port. So what's the problem with my config(?) or the code?
Err...the question is, because I use H800(sm_90), I can't use cuda unless I update my PyTorch to 2.0.0+. But if I did, there will be a compatibility problem between my PyTorch version and your code(PyTorch 1.10.1)... What should I do? Thanks.
Could you try directly running the codes with the PyTorch 2.0+ version and see if there are any errors?