llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

How can multiple programs interact with it simultaneously?

Open LDY911 opened this issue 10 months ago • 0 comments

After deploying a large model locally, how can multiple programs interact with it simultaneously? I encountered an error message: ‘’‘ [W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). [W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use). [E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address. Traceback (most recent call last): File "/home/lidongyang/anaconda3/envs/llama3/bin/torchrun", line 8, in sys.exit(main()) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent result = agent.run() File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run result = self._invoke_run(role) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run self._initialize_workers(self._worker_group) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_workers self._rendezvous(worker_group) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 542, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous self._store = TCPStore( # type: ignore[call-arg] torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

’‘’

LDY911 avatar Apr 28 '24 06:04 LDY911