llama
llama copied to clipboard
Distributed package doesn't have NCCL / The requested address is not valid in its context.
(venv) D:\Downloads\LLaMA>torchrun --nproc_per_node 2 example.py --ckpt_dir models/13B --tokenizer_path models/tokenizer.model
NOTE: Redirects are currently not supported in Windows or MacOs.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
Traceback (most recent call last):
File "D:\Downloads\LLaMA\example.py", line 119, in <module>
File "D:\Downloads\LLaMA\example.py", line 119, in <module>
fire.Fire(main)
File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
fire.Fire(main)
File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component, remaining_args = _CallAndUpdateTrace(
File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "D:\Downloads\LLaMA\example.py", line 74, in main
component = fn(*varargs, **kwargs)
File "D:\Downloads\LLaMA\example.py", line 74, in main
local_rank, world_size = setup_model_parallel()
File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
local_rank, world_size = setup_model_parallel()
File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
torch.distributed.init_process_group("nccl")
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
torch.distributed.init_process_group("nccl")
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
default_pg = _new_process_group_helper(
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
default_pg = _new_process_group_helper(
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9708) of binary: D:\Downloads\LLaMA\venv\Scripts\python.exe
Traceback (most recent call last):
File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\Downloads\LLaMA\venv\Scripts\torchrun.exe\__main__.py", line 7, in <module>
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 762, in main
run(args)
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 753, in run
elastic_launch(
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-04_18:21:06
host : ChrisPC
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2288)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-04_18:21:06
host : ChrisPC
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 9708)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
nccl is not available on Windows. Switch to Linux or change "nccl" to "gloo" here in example.py
Won't that use CPU instead of GPU?
NCCL is a pain. I'm assuming you are running this on windows in conda or similar environment? The easiest way is to just deal with hpc-sdk as it includes nccl. However you will most likely will have to download the tar from nvidia, and extract it yourself. Ensure you have full privileges or it won't work. https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html
@Inserian I encounter the same error on ubuntu 20.04 with nvidia-hpc-sdk module enabled. Do you know if there might be another error preventing llama from using nccl?
I assumed we would just be running the smaller models on our own GPU without distributed training. Any chance an rtx 4080 can run 13B if we trade off VRAM for generation time?
I had the same issues y´all described. So i tried everything i could find, and finally i found my problem. If you install pytorch via conda, the standard package is cpu only. I will provide a link where you can find further information on how to download the gpu variant for pytorch.
https://pytorch.org/get-started/locally/
I hope this helps at least some of you.
Seems like the issue is resolved by suggestions above. Please re-open as needed with more detail.