llama RuntimeError: Distributed package doesn't have NCCL built in

I was able to download the 7B weights on Mac OS Monterey. I get the following errors when I try to call the example from the README in my Terminal: torchrun --nproc_per_node 1 example.py --ckpt_dir download/model_size --tokenizer_path download/tokenizer.model

RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51512) of binary: /Users/username/opt/anaconda3/envs/pytorch/bin/python
.
.
.
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-04_14:30:38
  host      : COMPUTER.tld
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 51512)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Mar 04 '23 19:03 qsimeon

You will have to manually add nccl. Make sure you have full privileges before choosing your install from nvidia. HPC-SDK is easiest, but downloading the tar and extracting to usr\local works the same. https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html

Mar 04 '23 19:03 Inserian

I am on a mac pro with M1 max using 20 gpus - any idea how to resolve the nccl issue given no nVidia cards are installed?

Mar 05 '23 00:03 tekspirit

I am on a mac pro with M1 max using 20 gpus - any idea how to resolve the nccl issue given no nVidia cards are installed?

I run print(torch.backends.mps.is_built()) and it returns TRUE but when I set torch.distributed.init_process_group("mps") in example,py and run it, it complains mps cannot be found. error ValueError: Invalid backend: 'mps' Any ideas for getting the backend to run on m1?

Mar 05 '23 03:03 tekspirit

You can't resolve nccl issues without nvidia. There are other process's that could be used as opposed to nccl, there are also other libraries which allow parallel work around but I haven't bothered with them yet. As far as torch run, it looks like you didn't input your MP value? If that still doesn't work, try python -m torch.distributed.run --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model (editing the mp value of course).

Mar 05 '23 19:03 Inserian

Hello guys, I am also interested to see how to run LLaMA (e.g. 7B model) on Mac M1 or M2, any solution?

Mar 06 '23 21:03 andrewssobral

same issue

Mar 07 '23 21:03 Eurus-Holmes

me too!

Mar 13 '23 22:03 tekspirit

I'm on a Macbook Pro M1 2022 and have the same problem.

Mar 28 '23 19:03 bcouetil

Did anyone find out how to solve this error? I am having the same issue here.

Apr 05 '23 15:04 bdabykov

For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp

Works like a charm on my side, with the 3 models that fit in my RAM ✌️

Apr 05 '23 15:04 bcouetil

For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp

Works like a charm on my side, with the 3 models that fit in my RAM ✌️

is it utilizing MPS acceleration from the M1 / M2 chip?

Apr 19 '23 01:04 signalprime

I also have the NCCL error: raise RuntimeError("Distributed package doesn't have NCCL " "built in") untimeError: Distributed package doesn't have NCCL built in

May 11 '23 15:05 AngelTs

I have same problem ... I use M1 pro

May 25 '23 05:05 Sunjung-Dev

same issue I have on MacBook Pro m1 16gb

raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in

Jul 19 '23 11:07 araby123

For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp Works like a charm on my side, with the 3 models that fit in my RAM ✌️

is it utilizing MPS acceleration from the M1 / M2 chip?

It utilizes my iGPU to it's fullest, and not much CPU, if this is your question.

Jul 19 '23 19:07 bcouetil

There is a bit of customisation required to the newer model.py and generation.py files at minimum.

You need to register the mps device device = torch.device('mps') and then reference that in a few places, as well as changing .cuda() to .to(device)

torch.distributed.init_process_group("gloo") is another change to make from nccl

There are also a number of other cuda references in torch that have to change, including tensors.

Jul 20 '23 14:07 byronrode

I have the same error when running torchrun --nproc_per_node 1 example.py --ckpt_dir download/model_size --tokenizer_path download/tokenizer.model in my windows 11 conda environment, any solution?

Jul 21 '23 02:07 sixian-C

I was able to run Llama 2 7B on Mac M2 with (https://github.com/aggiee/llama-v2-mps)

Jul 21 '23 21:07 aggiee

I was able to run Llama 2 7B on Mac M2 with (https://github.com/aggiee/llama-v2-mps) @aggiee

your code returns an error message indicating that the function torch.polar() is not implemented for the Metal Performance Shaders (MPS) ?? I'm also running on an M2 Mac.

Jul 21 '23 22:07 3zerevelt

I have same problem ... I use M1 pro

Jul 22 '23 00:07 g8gg

If you are referring to the following message, it is expected. It's due to M1/M2/MPS not supporting polar.out operator. It falls back to CPU for that specific operation and the warning is to inform the user about it: "UserWarning: The operator 'aten::polar.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1aidzjezue/croot/pytorch_1687856425340/work/aten/src/ATen/mps/MPSFallback.mm:11.) freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64"

The solution is to set PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to run this code. That should make it work (you will still see the user warning about polar.out, but the code should run past that)

Jul 22 '23 19:07 aggiee

If you are referring to the following message, it is expected. It's due to M1/M2/MPS not supporting polar.out operator. It falls back to CPU for that specific operation and the warning is to inform the user about it: "UserWarning: The operator 'aten::polar.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1aidzjezue/croot/pytorch_1687856425340/work/aten/src/ATen/mps/MPSFallback.mm:11.) freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64"

The solution is to set PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to run this code. That should make it work (you will still see the user warning about polar.out, but the code should run past that)

this work but have to much slow in performance

Jul 23 '23 12:07 araby123

In case you run Windows 10 as me, I had the same RuntimeError: Distributed package doesn't have NCCL built in error. To fix it I checked the code of Llama class https://github.com/facebookresearch/llama/blob/6c7fe276574e78057f917549435a2554000a876d/llama/generation.py#L61-L62 and saw how torch.distributed is initialized. One can check all possible backends at distributed.html#torch.distributed.init_process_group. I changed the code to initialize it with gloo backend, dist.init_process_group(backend="gloo")

git diff:

Index: example_text_completion.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/example_text_completion.py b/example_text_completion.py
--- a/example_text_completion.py	(revision 6c7fe276574e78057f917549435a2554000a876d)
+++ b/example_text_completion.py	(date 1690453793087)
@@ -5,6 +5,9 @@
 
 from llama import Llama
 
+import torch
+import torch.distributed as dist
+
 
 def main(
     ckpt_dir: str,
@@ -15,6 +18,8 @@
     max_gen_len: int = 64,
     max_batch_size: int = 4,
 ):
+    dist.init_process_group(backend="gloo")
+
     generator = Llama.build(
         ckpt_dir=ckpt_dir,
         tokenizer_path=tokenizer_path,
@@ -52,4 +57,5 @@
 
 
 if __name__ == "__main__":
+    print("Cuda support:", torch.cuda.is_available(),":", torch.cuda.device_count(), "devices")
     fire.Fire(main)

After that change I was able to run

(base) H:\github\facebook\llama>torchrun --standalone --nnodes=1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4
NOTE: Redirects are currently not supported in Windows or MacOs.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Cuda support: True : 1 devices
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 13.40 seconds
I believe the meaning of life is
> to be happy. I believe we are all born with the potential to be happy. The meaning of life is to be happy, but the way to get there is not always easy.
The meaning of life is to be happy. It is not always easy to be happy, but it is possible. I believe that

==================================

Simply put, the theory of relativity states that
> 1) time, space, and mass are relative, and 2) the speed of light is constant, regardless of the relative motion of the observer.
Let’s look at the first point first.
Relative Time and Space
The theory of relativity is built on the idea that time and space are relative

==================================

A brief message congratulating the team on the launch:

        Hi everyone,

        I just
> wanted to say a big congratulations to the team on the launch of the new website.

        I think it looks fantastic and I'm sure it will be a huge success.

        I look forward to working with you all on the next project.

        Best wishes



==================================

Translate English to French:

        sea otter => loutre de mer
        peppermint => menthe poivrée
        plush girafe => girafe peluche
        cheese =>
> fromage
        fish => poisson
        giraffe => girafe
        elephant => éléphant
        cat => chat
        giraffe => girafe
        elephant => éléphant
        cat => chat
        giraffe => gira

==================================

Make sure you have enough RAM and GPU RAM, my RAM consumption when the model is loaded

GPU ram:

Thu Jul 27 18:48:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  | 00000000:0C:00.0  On |                  Off |
| 30%   40C    P2             151W / 450W |  15160MiB / 24564MiB |     53%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

If you get OOM error like below that but you have enough GPU RAM:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 23.99 GiB total capacity; 7.55 GiB already allocated; 14.84 GiB free; 7.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 21068) of binary: c:\Users\User\miniconda3\python.exe

make sure that you actually have enough RAM. You can modify PageFile to use disk as memory, see https://gist.github.com/REASY/567c48e021288df505140cad7e4562ab?permalink_comment_id=4650490#gistcomment-4650490

Note: I fixed torchrun, one can modify torchrun-script.py to make it work. In my case, I use miniconda, the full path is c:\Users\User\miniconda3\Scripts\torchrun-script.py and I had to fix the first line of that to point to the full path of Python shipped with miniconda:

#!c:\Users\User\miniconda3\python.exe

My env gathered via python -m torch.utils.collect_env

(base) H:\github\facebook\llama>python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro N
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:47:18) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 536.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=3493
DeviceID=CPU0
Family=107
L2CacheSize=8192
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3493
Name=AMD Ryzen 9 3950X 16-Core Processor
ProcessorType=3
Revision=28928

Versions of relevant libraries:
[pip3] numpy==1.25.0
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2+cu117
[pip3] torchvision==0.15.2+cu117
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.8.0               hd77b12b_0
[conda] mkl                       2023.1.0         h8bd8f75_46356
[conda] mkl-service               2.4.0           py311h2bbff1b_1
[conda] mkl_fft                   1.3.6           py311hf62ec03_1
[conda] mkl_random                1.2.2           py311hf62ec03_1
[conda] numpy                     1.25.1                   pypi_0    pypi
[conda] numpy-base                1.25.0          py311hd01c5d8_0
[conda] pytorch                   2.0.1           py3.11_cuda11.8_cudnn8_0    pytorch
[conda] pytorch-cuda              11.8                 h24eeafa_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torchaudio                2.0.2+cu117              pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi

Jul 27 '23 10:07 REASY

Just initialized with torch.distributed.init_process_group("gloo") go to the generation.py file and find the following line

 if not torch.distributed.is_initialized():
            if device == "cuda":
                torch.distributed.init_process_group("nccl")
            else:
                torch.distributed.init_process_group("gloo")

change it to

 if not torch.distributed.is_initialized():
            if device == "cuda":
                 torch.distributed.init_process_group("gloo")
                
            else:
                torch.distributed.init_process_group("nccl")

Jul 30 '23 00:07 MDFARHYN

Seems like the issue was resolved with suggestions above. Feel free to re-open as needed. Closing

Sep 06 '23 16:09 WuhanMonkey

Why does we still don't have a solution to this error?

Sep 20 '23 11:09 dunanyang

I've been able to start execution after applying changes similar to https://github.com/facebookresearch/codellama/pull/18/files

Oct 22 '23 16:10 psmyrdek

https://github.com/pianistprogrammer/llama3/tree/main, get this one, clone the repo, i have made changes to some files to make it work. You can find it in the commit tree

Apr 20 '24 08:04 pianistprogrammer

Hey @pianistprogrammer 👋🏻

I tried your fork but got an error:

RuntimeError: Placeholder storage has not been allocated on MPS device!

It's M1 Pro. Any clue what is the issue?

Full logs:

(base) ➜  llama3-pianist git:(main) ✗ PYTORCH_ENABLE_MPS_FALLBACK=1 torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 128 --max_batch_size 4
W0513 11:17:12.135000 8470690496 torch/distributed/elastic/multiprocessing/redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/miniconda3/lib/python3.12/site-packages/torch/__init__.py:747: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:433.)
  _C._set_default_tensor_type(t)
Loaded in 37.49 seconds
[rank0]: Traceback (most recent call last):
[rank0]:   File "/llama3-pianist/example_text_completion.py", line 64, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/example_text_completion.py", line 51, in main
[rank0]:     results = generator.text_completion(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/llama/generation.py", line 282, in text_completion
[rank0]:     generation_tokens, generation_logprobs = self.generate(
[rank0]:                                              ^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/llama/generation.py", line 201, in generate
[rank0]:     logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/llama/model.py", line 291, in forward
[rank0]:     h = self.tok_embeddings(tokens)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fairscale/nn/model_parallel/layers.py", line 136, in forward
[rank0]:     output_parallel = F.embedding(
[rank0]:                       ^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/functional.py", line 2264, in embedding
[rank0]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Placeholder storage has not been allocated on MPS device!
E0513 11:17:57.237000 8470690496 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 5741) of binary: /opt/miniconda3/bin/python
Traceback (most recent call last):
  File "/opt/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_text_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-13_11:17:57
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5741)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

May 13 '24 04:05 haruelrovix

I'm sorry about that, i have made a blog post on how to get it locally, https://questionbump.com/question/how-can-i-run-chatgpt-using-llms-locally/

May 13 '24 18:05 pianistprogrammer

llama llama copied to clipboard

RuntimeError: Distributed package doesn't have NCCL built in

If you get OOM error like below that but you have enough GPU RAM:

llama
llama copied to clipboard