llama icon indicating copy to clipboard operation
llama copied to clipboard

The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500)

Open atigm opened this issue 2 years ago • 4 comments
trafficstars

Windows 10 pro Nvidia Geoforce GTX 1080 (8Go vRAM) Intel Core i7 CPU 4Ghz RAM 64 Go

I run for 7b: torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4

I have this error

NOTE: Redirects are currently not supported in Windows or MacOs.
[E ..\torch\csrc\distributed\c10d\socket.cpp:860] [c10d] The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).
Traceback (most recent call last):
  File "E:\ai\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "E:\ai\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "E:\ai\llama\env\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\launcher\api.py", line 241, in launch_agent
    result = agent.run()
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 723, in run
    result = self._invoke_run(role)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "E:\ai\llama\env\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
TimeoutError: The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).

Thank you

atigm avatar Aug 03 '23 12:08 atigm

您好,我也遇到了这个问题,请问您解决了吗

JinChow avatar Aug 05 '23 06:08 JinChow

Not yet. Have you had this problem on your own computer or on a company computer with a private network and security restrictions.

atigm avatar Aug 05 '23 08:08 atigm

you need update your window10 system

xushaungchunzhu avatar May 11 '24 12:05 xushaungchunzhu

fuck fuck

xushaungchunzhu avatar May 11 '24 12:05 xushaungchunzhu