llama
llama copied to clipboard
The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500)
trafficstars
Windows 10 pro Nvidia Geoforce GTX 1080 (8Go vRAM) Intel Core i7 CPU 4Ghz RAM 64 Go
I run for 7b:
torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4
I have this error
NOTE: Redirects are currently not supported in Windows or MacOs.
[E ..\torch\csrc\distributed\c10d\socket.cpp:860] [c10d] The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).
Traceback (most recent call last):
File "E:\ai\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "E:\ai\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "E:\ai\llama\env\Scripts\torchrun.exe\__main__.py", line 7, in <module>
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\run.py", line 794, in main
run(args)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "E:\ai\llama\env\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "E:\ai\llama\env\lib\site-packages\torch\distributed\launcher\api.py", line 241, in launch_agent
result = agent.run()
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 723, in run
result = self._invoke_run(role)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
TimeoutError: The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).
Thank you
您好,我也遇到了这个问题,请问您解决了吗
Not yet. Have you had this problem on your own computer or on a company computer with a private network and security restrictions.
you need update your window10 system
fuck fuck