After deploying a large model locally, how can multiple programs interact with it simultaneously?
I encountered an error message:
‘’‘
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/home/lidongyang/anaconda3/envs/llama3/bin/torchrun", line 8, in
sys.exit(main())
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_workers
self._rendezvous(worker_group)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 542, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/home/lidongyang/anaconda3/envs/llama3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
’‘’