llama icon indicating copy to clipboard operation
llama copied to clipboard

Distributed package doesn't have NCCL / The requested address is not valid in its context.

Open Tophness opened this issue 1 year ago • 5 comments

(venv) D:\Downloads\LLaMA>torchrun --nproc_per_node 2 example.py --ckpt_dir models/13B --tokenizer_path models/tokenizer.model
NOTE: Redirects are currently not supported in Windows or MacOs.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\Downloads\LLaMA\example.py", line 119, in <module>
  File "D:\Downloads\LLaMA\example.py", line 119, in <module>
        fire.Fire(main)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
fire.Fire(main)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component, remaining_args = _CallAndUpdateTrace(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "D:\Downloads\LLaMA\example.py", line 74, in main
    component = fn(*varargs, **kwargs)
  File "D:\Downloads\LLaMA\example.py", line 74, in main
    local_rank, world_size = setup_model_parallel()
  File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
    local_rank, world_size = setup_model_parallel()
  File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
    torch.distributed.init_process_group("nccl")
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
    torch.distributed.init_process_group("nccl")
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
    default_pg = _new_process_group_helper(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9708) of binary: D:\Downloads\LLaMA\venv\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Downloads\LLaMA\venv\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 762, in main
    run(args)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 753, in run
    elastic_launch(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-04_18:21:06
  host      : ChrisPC
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2288)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-04_18:21:06
  host      : ChrisPC
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9708)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Tophness avatar Mar 04 '23 07:03 Tophness

nccl is not available on Windows. Switch to Linux or change "nccl" to "gloo" here in example.py

neuhaus avatar Mar 04 '23 10:03 neuhaus

Won't that use CPU instead of GPU?

Tophness avatar Mar 04 '23 19:03 Tophness

NCCL is a pain. I'm assuming you are running this on windows in conda or similar environment? The easiest way is to just deal with hpc-sdk as it includes nccl. However you will most likely will have to download the tar from nvidia, and extract it yourself. Ensure you have full privileges or it won't work. https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html

Inserian avatar Mar 04 '23 19:03 Inserian

@Inserian I encounter the same error on ubuntu 20.04 with nvidia-hpc-sdk module enabled. Do you know if there might be another error preventing llama from using nccl?

TanaroSch avatar Mar 06 '23 23:03 TanaroSch

I assumed we would just be running the smaller models on our own GPU without distributed training. Any chance an rtx 4080 can run 13B if we trade off VRAM for generation time?

Tophness avatar Mar 07 '23 08:03 Tophness

I had the same issues y´all described. So i tried everything i could find, and finally i found my problem. If you install pytorch via conda, the standard package is cpu only. I will provide a link where you can find further information on how to download the gpu variant for pytorch.

https://pytorch.org/get-started/locally/

I hope this helps at least some of you.

MaximilianDueppe avatar Jul 23 '23 18:07 MaximilianDueppe

Seems like the issue is resolved by suggestions above. Please re-open as needed with more detail.

WuhanMonkey avatar Sep 06 '23 16:09 WuhanMonkey