LMFlow
LMFlow copied to clipboard
The running app. py port is occupied, but this port is not in use
- Restarting with stat
[2023-04-20 08:22:35,933] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:10086 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:10086 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/LMFlow/service/app.py", line 34, in
model = AutoModel.get_model(model_args, tune_strategy='none', ds_config=ds_config) File "/LMFlow/src/lmflow/models/auto_model.py", line 16, in get_model return HFDecoderModel(model_args, *args, **kwargs) File "/LMFlow/src/lmflow/models/hf_decoder_model.py", line 237, in init deepspeed.init_distributed() File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 656, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 36, in init self.init_process_group(backend, timeout, init_method, rank, world_size) File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 40, in init_process_group torch.distributed.init_process_group(backend, File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 888, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store return TCPStore( RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:10086 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:10086 (errno: 98 - Address already in use).
As the error message stated: the port is in use. Please try to change the port. For example:
deepspeed --num_gpus=1 --master_port=11000 examples/finetune.py
This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks