Hi I am trining PandaGPT, I have 8 V100 GPUs. When I run ./scripts/train.sh, I got the following error:
Traceback (most recent call last):
File "user/test_panda/PandaGPT/code/train_sft.py", line 97, in
main(**args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 55, in main
config_env(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 45, in config_env
initialize_distributed(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 29, in initialize_distributed
deepspeed.init_distributed(dist_backend='nccl')
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:28457 (errno: 98 - Address already in use).
[2023-08-10 15:50:31,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15743
[2023-08-10 15:50:31,172] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15744
[2023-08-10 15:50:31,180] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15745
[2023-08-10 15:50:31,187] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15746
[2023-08-10 15:50:31,239] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15747
[2023-08-10 15:50:31,291] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15748
[2023-08-10 15:50:31,344] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15749
[2023-08-10 15:50:31,396] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15750
Do you have any idea how to solve this? Thank you so much!