llama
llama copied to clipboard
RuntimeError: Distributed package doesn't have NCCL built in
I was able to download the 7B weights on Mac OS Monterey. I get the following errors when I try to call the example from the README in my Terminal: torchrun --nproc_per_node 1 example.py --ckpt_dir download/model_size --tokenizer_path download/tokenizer.model
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51512) of binary: /Users/username/opt/anaconda3/envs/pytorch/bin/python
.
.
.
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-04_14:30:38
host : COMPUTER.tld
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 51512)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
You will have to manually add nccl. Make sure you have full privileges before choosing your install from nvidia. HPC-SDK is easiest, but downloading the tar and extracting to usr\local works the same. https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html
I am on a mac pro with M1 max using 20 gpus - any idea how to resolve the nccl issue given no nVidia cards are installed?
I am on a mac pro with M1 max using 20 gpus - any idea how to resolve the nccl issue given no nVidia cards are installed?
I run print(torch.backends.mps.is_built()) and it returns TRUE but when I set torch.distributed.init_process_group("mps") in example,py and run it, it complains mps cannot be found.
error ValueError: Invalid backend: 'mps'
Any ideas for getting the backend to run on m1?
You can't resolve nccl issues without nvidia. There are other process's that could be used as opposed to nccl, there are also other libraries which allow parallel work around but I haven't bothered with them yet. As far as torch run, it looks like you didn't input your MP value? If that still doesn't work, try python -m torch.distributed.run --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model (editing the mp value of course).
Hello guys, I am also interested to see how to run LLaMA (e.g. 7B model) on Mac M1 or M2, any solution?
same issue
me too!
I'm on a Macbook Pro M1 2022 and have the same problem.
Did anyone find out how to solve this error? I am having the same issue here.
For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp
Works like a charm on my side, with the 3 models that fit in my RAM ✌️
For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp
Works like a charm on my side, with the 3 models that fit in my RAM ✌️
is it utilizing MPS acceleration from the M1 / M2 chip?
I also have the NCCL error: raise RuntimeError("Distributed package doesn't have NCCL " "built in") untimeError: Distributed package doesn't have NCCL built in
I have same problem ... I use M1 pro
same issue I have on MacBook Pro m1 16gb
raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in
For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp Works like a charm on my side, with the 3 models that fit in my RAM ✌️
is it utilizing MPS acceleration from the M1 / M2 chip?
It utilizes my iGPU to it's fullest, and not much CPU, if this is your question.
There is a bit of customisation required to the newer model.py and generation.py files at minimum.
You need to register the mps device device = torch.device('mps') and then reference that in a few places, as well as changing .cuda() to .to(device)
torch.distributed.init_process_group("gloo") is another change to make from nccl
There are also a number of other cuda references in torch that have to change, including tensors.
I have the same error when running torchrun --nproc_per_node 1 example.py --ckpt_dir download/model_size --tokenizer_path download/tokenizer.model in my windows 11 conda environment, any solution?
I was able to run Llama 2 7B on Mac M2 with (https://github.com/aggiee/llama-v2-mps)
I was able to run Llama 2 7B on Mac M2 with (https://github.com/aggiee/llama-v2-mps) @aggiee
your code returns an error message indicating that the function torch.polar() is not implemented for the Metal Performance Shaders (MPS) ?? I'm also running on an M2 Mac.
I have same problem ... I use M1 pro
If you are referring to the following message, it is expected. It's due to M1/M2/MPS not supporting polar.out operator. It falls back to CPU for that specific operation and the warning is to inform the user about it: "UserWarning: The operator 'aten::polar.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1aidzjezue/croot/pytorch_1687856425340/work/aten/src/ATen/mps/MPSFallback.mm:11.) freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64"
The solution is to set PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to run this code. That should make it work (you will still see the user warning about polar.out, but the code should run past that)
If you are referring to the following message, it is expected. It's due to M1/M2/MPS not supporting polar.out operator. It falls back to CPU for that specific operation and the warning is to inform the user about it: "UserWarning: The operator 'aten::polar.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1aidzjezue/croot/pytorch_1687856425340/work/aten/src/ATen/mps/MPSFallback.mm:11.) freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64"
The solution is to set PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to run this code. That should make it work (you will still see the user warning about polar.out, but the code should run past that)
this work but have to much slow in performance
In case you run Windows 10 as me, I had the same RuntimeError: Distributed package doesn't have NCCL built in error. To fix it I checked the code of Llama class https://github.com/facebookresearch/llama/blob/6c7fe276574e78057f917549435a2554000a876d/llama/generation.py#L61-L62 and saw how torch.distributed is initialized. One can check all possible backends at distributed.html#torch.distributed.init_process_group. I changed the code to initialize it with gloo backend, dist.init_process_group(backend="gloo")
git diff:
Index: example_text_completion.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/example_text_completion.py b/example_text_completion.py
--- a/example_text_completion.py (revision 6c7fe276574e78057f917549435a2554000a876d)
+++ b/example_text_completion.py (date 1690453793087)
@@ -5,6 +5,9 @@
from llama import Llama
+import torch
+import torch.distributed as dist
+
def main(
ckpt_dir: str,
@@ -15,6 +18,8 @@
max_gen_len: int = 64,
max_batch_size: int = 4,
):
+ dist.init_process_group(backend="gloo")
+
generator = Llama.build(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
@@ -52,4 +57,5 @@
if __name__ == "__main__":
+ print("Cuda support:", torch.cuda.is_available(),":", torch.cuda.device_count(), "devices")
fire.Fire(main)
After that change I was able to run
(base) H:\github\facebook\llama>torchrun --standalone --nnodes=1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4
NOTE: Redirects are currently not supported in Windows or MacOs.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Cuda support: True : 1 devices
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 13.40 seconds
I believe the meaning of life is
> to be happy. I believe we are all born with the potential to be happy. The meaning of life is to be happy, but the way to get there is not always easy.
The meaning of life is to be happy. It is not always easy to be happy, but it is possible. I believe that
==================================
Simply put, the theory of relativity states that
> 1) time, space, and mass are relative, and 2) the speed of light is constant, regardless of the relative motion of the observer.
Let’s look at the first point first.
Relative Time and Space
The theory of relativity is built on the idea that time and space are relative
==================================
A brief message congratulating the team on the launch:
Hi everyone,
I just
> wanted to say a big congratulations to the team on the launch of the new website.
I think it looks fantastic and I'm sure it will be a huge success.
I look forward to working with you all on the next project.
Best wishes
==================================
Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>
> fromage
fish => poisson
giraffe => girafe
elephant => éléphant
cat => chat
giraffe => girafe
elephant => éléphant
cat => chat
giraffe => gira
==================================
Make sure you have enough RAM and GPU RAM, my RAM consumption when the model is loaded
GPU ram:
Thu Jul 27 18:48:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:0C:00.0 On | Off |
| 30% 40C P2 151W / 450W | 15160MiB / 24564MiB | 53% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
If you get OOM error like below that but you have enough GPU RAM:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 23.99 GiB total capacity; 7.55 GiB already allocated; 14.84 GiB free; 7.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 21068) of binary: c:\Users\User\miniconda3\python.exe
make sure that you actually have enough RAM. You can modify PageFile to use disk as memory, see https://gist.github.com/REASY/567c48e021288df505140cad7e4562ab?permalink_comment_id=4650490#gistcomment-4650490
Note: I fixed torchrun, one can modify torchrun-script.py to make it work. In my case, I use miniconda, the full path is c:\Users\User\miniconda3\Scripts\torchrun-script.py and I had to fix the first line of that to point to the full path of Python shipped with miniconda:
#!c:\Users\User\miniconda3\python.exe
My env gathered via python -m torch.utils.collect_env
(base) H:\github\facebook\llama>python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Pro N
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 13:47:18) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 536.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=3493
DeviceID=CPU0
Family=107
L2CacheSize=8192
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3493
Name=AMD Ryzen 9 3950X 16-Core Processor
ProcessorType=3
Revision=28928
Versions of relevant libraries:
[pip3] numpy==1.25.0
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2+cu117
[pip3] torchvision==0.15.2+cu117
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.8.0 hd77b12b_0
[conda] mkl 2023.1.0 h8bd8f75_46356
[conda] mkl-service 2.4.0 py311h2bbff1b_1
[conda] mkl_fft 1.3.6 py311hf62ec03_1
[conda] mkl_random 1.2.2 py311hf62ec03_1
[conda] numpy 1.25.1 pypi_0 pypi
[conda] numpy-base 1.25.0 py311hd01c5d8_0
[conda] pytorch 2.0.1 py3.11_cuda11.8_cudnn8_0 pytorch
[conda] pytorch-cuda 11.8 h24eeafa_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.0.1 pypi_0 pypi
[conda] torchaudio 2.0.2+cu117 pypi_0 pypi
[conda] torchvision 0.15.2 pypi_0 pypi
Just initialized with torch.distributed.init_process_group("gloo") go to the generation.py file and find the following line
if not torch.distributed.is_initialized():
if device == "cuda":
torch.distributed.init_process_group("nccl")
else:
torch.distributed.init_process_group("gloo")
change it to
if not torch.distributed.is_initialized():
if device == "cuda":
torch.distributed.init_process_group("gloo")
else:
torch.distributed.init_process_group("nccl")
Seems like the issue was resolved with suggestions above. Feel free to re-open as needed. Closing
Why does we still don't have a solution to this error?
I've been able to start execution after applying changes similar to https://github.com/facebookresearch/codellama/pull/18/files
https://github.com/pianistprogrammer/llama3/tree/main, get this one, clone the repo, i have made changes to some files to make it work. You can find it in the commit tree
Hey @pianistprogrammer 👋🏻
I tried your fork but got an error:
RuntimeError: Placeholder storage has not been allocated on MPS device!
It's M1 Pro. Any clue what is the issue?
Full logs:
(base) ➜ llama3-pianist git:(main) ✗ PYTORCH_ENABLE_MPS_FALLBACK=1 torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 128 --max_batch_size 4
W0513 11:17:12.135000 8470690496 torch/distributed/elastic/multiprocessing/redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/miniconda3/lib/python3.12/site-packages/torch/__init__.py:747: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:433.)
_C._set_default_tensor_type(t)
Loaded in 37.49 seconds
[rank0]: Traceback (most recent call last):
[rank0]: File "/llama3-pianist/example_text_completion.py", line 64, in <module>
[rank0]: fire.Fire(main)
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/llama3-pianist/example_text_completion.py", line 51, in main
[rank0]: results = generator.text_completion(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/llama3-pianist/llama/generation.py", line 282, in text_completion
[rank0]: generation_tokens, generation_logprobs = self.generate(
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/llama3-pianist/llama/generation.py", line 201, in generate
[rank0]: logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/llama3-pianist/llama/model.py", line 291, in forward
[rank0]: h = self.tok_embeddings(tokens)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/fairscale/nn/model_parallel/layers.py", line 136, in forward
[rank0]: output_parallel = F.embedding(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/functional.py", line 2264, in embedding
[rank0]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Placeholder storage has not been allocated on MPS device!
E0513 11:17:57.237000 8470690496 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 5741) of binary: /opt/miniconda3/bin/python
Traceback (most recent call last):
File "/opt/miniconda3/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_text_completion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-13_11:17:57
host : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5741)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I'm sorry about that, i have made a blog post on how to get it locally, https://questionbump.com/question/how-can-i-run-chatgpt-using-llms-locally/