llama
llama copied to clipboard
Issue with Redirects Not Supported Error in Windows and macOS. When running torchrun, a RuntimeError is encountered with the message "unmatched '}' in format string."
Description: When running the command, a RuntimeError is encountered with the message "unmatched '}' in format string." Run command
torchrun --nproc_per_node 1 -- rd example.py --ckpt_dir ./models/7B --tokenizer_path ./models/tokenizer.model
I encountered an issue while running a script that involves redirecting output. It seems that redirects are currently not supported in Windows environments This issue causes a runtime error with the following traceback:
NOTE: Redirects are currently not supported in Windows or macOS.
Traceback (most recent call last):
File "C:\Users\[username]\anaconda3\Scripts\torchrun-script.py", line 34, in <module>
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\run.py", line 794, in main
run(args)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\launcher\api.py", line 241, in launch_agent
result = agent.run()
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 723, in run
result = self._invoke_run(role)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: unmatched '}' in format string
Environment: Operating System: Windows10 Python Version: Python 3.9.13 Torch Version: 2.0.1
Please let me know if any further information is required to address this issue.
the same problem is on Mac OS
Operating System: Mac OS, 11.5.2 Python Version: Python 3.11.3 Torch Version: 2.0.1
Same issue on Windows 11.
torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "I:\projects\llama\example_text_completion.py", line 55, in
fire.Fire(main)
File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "I:\projects\llama\example_text_completion.py", line 18, in main
generator = Llama.build(
File "I:\projects\llama\llama\generation.py", line 62, in build
torch.distributed.init_process_group("nccl")
File "i:\apps\miniconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "i:\apps\miniconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 96416) of binary: i:\apps\miniconda3\python.exe
Traceback (most recent call last):
File "i:\apps\miniconda3\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "i:\apps\miniconda3\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "i:\apps\miniconda3\Scripts\torchrun.exe_main.py", line 7, in
File "i:\apps\miniconda3\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init .py", line 346, in wrapper
return f(*args, **kwargs)
File "i:\apps\miniconda3\lib\site-packages\torch\distributed\run.py", line 794, in main
run(args)
File "i:\apps\miniconda3\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "i:\apps\miniconda3\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "i:\apps\miniconda3\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
example_text_completion.py FAILED
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
I tried this, but nothing changed for me. Still trying to resolve this issue.
Same issue model llama-2-7b-chat What I tried: (I will update this)
- adding torch.distributed.init_process_group("gloo") => doesn't work
- import os os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" => doesn't work
- I tried different --max_batch_size 1, 3, 6 etc. => doesn't work
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
Its not worked in Windows for me also
same here any solution ?
Same error here, while i'm trying to run example_chat_completion.py System: MacOS 14.0 (M1)
File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1268, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in [2023-10-08 20:52:17,432] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 70815) of binary: /opt/homebrew/opt/[email protected]/bin/python3.11
I got passed of this problem by installing gloo and going to generation.py and changing nccl to gloo find it and just replace it.
I have been trying to run it for 2 days.but i think 8gb vram and 32gb ram is not enough to run it. I will suggest you to use model from the_bloke from huggingface which are quantized and use llama.cpp to run on your mac . You can even run it using your mac gpu.
But if you want to train data then learn somehow how to train and then convert it to a quantize model runnable with llama.cpp
Si me inicializa el modelo pero aparece este error [2023-11-08 12:23:44,484] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.).
initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 104, in
fire.Fire(main) File "C:\Python311\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 691, in CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 35, in main generator = Llama.build( ^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\llama\generation.py", line 92, in build torch.cuda.set_device(local_rank) File "C:\Python311\Lib\site-packages\torch\cuda_init.py", line 404, in set_device torch._C._cuda_setDevice(device) ^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'torch._C' has no attribute '_cuda_setDevice' [2023-11-08 12:23:49,537] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22516) of binary: C:\Python311\python.exe Traceback (most recent call last): File " ", line 198, in run_module_as_main File " ", line 88, in run_code File "C:\Python311\Scripts\torchrun.exe_main.py", line 7, in .py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example_chat_completion.py FAILEDFile "C:\Python311\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-11-08_12:23:49 host : CC rank : 0 (local_rank: 0) exitcode : 1 (pid: 22516) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
looks like this resolved my issue. even though another issue comes... but it is a diff one now.
have people to resolve this issue?
have people to resolve this issue?
I am able to run on Debian environment. Is there any solution for this issue on Windows. I tried to update generation.oy torch.distributed.init_process_group("nccl") --> generation.oy torch.distributed.init_process_group("gloo"). But it does not work.
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
Worked fine for me
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
Adding on to this, if you are mac and don't have cuda support, comment out the line:
torch.cuda.set_device(local_rank)
in ./llama/generate.py
./llama/generate.py