llama icon indicating copy to clipboard operation
llama copied to clipboard

Issue with Redirects Not Supported Error in Windows and macOS. When running torchrun, a RuntimeError is encountered with the message "unmatched '}' in format string."

Open ericoder960803 opened this issue 1 year ago • 17 comments

Description: When running the command, a RuntimeError is encountered with the message "unmatched '}' in format string." Run command

torchrun --nproc_per_node 1  -- rd example.py --ckpt_dir ./models/7B --tokenizer_path ./models/tokenizer.model

I encountered an issue while running a script that involves redirecting output. It seems that redirects are currently not supported in Windows environments This issue causes a runtime error with the following traceback:

NOTE: Redirects are currently not supported in Windows or macOS.
Traceback (most recent call last):
  File "C:\Users\[username]\anaconda3\Scripts\torchrun-script.py", line 34, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\launcher\api.py", line 241, in launch_agent
    result = agent.run()
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 723, in run
    result = self._invoke_run(role)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: unmatched '}' in format string

Environment: Operating System: Windows10 Python Version: Python 3.9.13 Torch Version: 2.0.1

Please let me know if any further information is required to address this issue.

ericoder960803 avatar Jul 12 '23 14:07 ericoder960803

the same problem is on Mac OS

Operating System: Mac OS, 11.5.2 Python Version: Python 3.11.3 Torch Version: 2.0.1

Delagardi avatar Jul 18 '23 22:07 Delagardi

Same issue on Windows 11.

torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4 NOTE: Redirects are currently not supported in Windows or MacOs. [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). Traceback (most recent call last): File "I:\projects\llama\example_text_completion.py", line 55, in fire.Fire(main) File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "I:\projects\llama\example_text_completion.py", line 18, in main generator = Llama.build( File "I:\projects\llama\llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") File "i:\apps\miniconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group default_pg = _new_process_group_helper( File "i:\apps\miniconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 96416) of binary: i:\apps\miniconda3\python.exe Traceback (most recent call last): File "i:\apps\miniconda3\lib\runpy.py", line 197, in _run_module_as_main return run_code(code, main_globals, None, File "i:\apps\miniconda3\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "i:\apps\miniconda3\Scripts\torchrun.exe_main.py", line 7, in File "i:\apps\miniconda3\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 346, in wrapper return f(*args, **kwargs) File "i:\apps\miniconda3\lib\site-packages\torch\distributed\run.py", line 794, in main run(args) File "i:\apps\miniconda3\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "i:\apps\miniconda3\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "i:\apps\miniconda3\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

sfcheng avatar Jul 19 '23 16:07 sfcheng

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

HelixNGC7293 avatar Jul 20 '23 04:07 HelixNGC7293

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

I tried this, but nothing changed for me. Still trying to resolve this issue.

MaximilianDueppe avatar Jul 20 '23 16:07 MaximilianDueppe

Same issue model llama-2-7b-chat What I tried: (I will update this)

  1. adding torch.distributed.init_process_group("gloo") => doesn't work
  2. import os os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" => doesn't work
  3. I tried different --max_batch_size 1, 3, 6 etc. => doesn't work

MirunaClinciu avatar Aug 24 '23 10:08 MirunaClinciu

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

Its not worked in Windows for me also

ajithkumar666 avatar Sep 11 '23 14:09 ajithkumar666

same here any solution ?

ghost avatar Oct 13 '23 05:10 ghost

Same error here, while i'm trying to run example_chat_completion.py System: MacOS 14.0 (M1)

File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1268, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in [2023-10-08 20:52:17,432] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 70815) of binary: /opt/homebrew/opt/[email protected]/bin/python3.11

mateury avatar Oct 13 '23 06:10 mateury

I got passed of this problem by installing gloo and going to generation.py and changing nccl to gloo find it and just replace it.

I have been trying to run it for 2 days.but i think 8gb vram and 32gb ram is not enough to run it. I will suggest you to use model from the_bloke from huggingface which are quantized and use llama.cpp to run on your mac . You can even run it using your mac gpu.

But if you want to train data then learn somehow how to train and then convert it to a quantize model runnable with llama.cpp

ghost avatar Oct 13 '23 09:10 ghost

Si me inicializa el modelo pero aparece este error [2023-11-08 12:23:44,484] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.).

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 104, in fire.Fire(main) File "C:\Python311\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 691, in CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 35, in main generator = Llama.build( ^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\llama\generation.py", line 92, in build torch.cuda.set_device(local_rank) File "C:\Python311\Lib\site-packages\torch\cuda_init.py", line 404, in set_device torch._C._cuda_setDevice(device) ^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'torch._C' has no attribute '_cuda_setDevice' [2023-11-08 12:23:49,537] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22516) of binary: C:\Python311\python.exe Traceback (most recent call last): File "", line 198, in run_module_as_main File "", line 88, in run_code File "C:\Python311\Scripts\torchrun.exe_main.py", line 7, in File "C:\Python311\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example_chat_completion.py FAILED


Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-11-08_12:23:49 host : CC rank : 0 (local_rank: 0) exitcode : 1 (pid: 22516) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Nehe12 avatar Nov 08 '23 19:11 Nehe12

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

looks like this resolved my issue. even though another issue comes... but it is a diff one now.

baokexu avatar Dec 03 '23 00:12 baokexu

have people to resolve this issue?

bravelyi avatar Mar 02 '24 13:03 bravelyi

have people to resolve this issue?

naget avatar Mar 06 '24 01:03 naget

I am able to run on Debian environment. Is there any solution for this issue on Windows. I tried to update generation.oy torch.distributed.init_process_group("nccl") --> generation.oy torch.distributed.init_process_group("gloo"). But it does not work.

shailenderjain avatar Mar 14 '24 07:03 shailenderjain

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

Worked fine for me

alperinugur avatar Apr 23 '24 13:04 alperinugur

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

Adding on to this, if you are mac and don't have cuda support, comment out the line: torch.cuda.set_device(local_rank) in ./llama/generate.py

Robert-Jia00129 avatar Jun 18 '24 18:06 Robert-Jia00129

./llama/generate.py

montyc123 avatar Jun 18 '24 19:06 montyc123