llama3 CUDA version incompatible, GPU not detected issue

Hello, thank you very much for developing and sharing a great model "LLAMA3". I'd like to Inference this great model. I was performing the contents of the README.md file accordingly. However, there was an error during execution, so I'm inquiring about the issue.

There was no problem doing the following.

In a conda env with PyTorch / CUDA available clone and download this repository.
In the top-level directory run: $ pip install -e .
Visit the Meta Llama website and register to download the model/s.
Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.
Once you get the email, navigate to your downloaded llama repository and run the download.sh script.

Make sure to grant execution permissions to the download.sh script During this process, you will be prompted to enter the URL from the email. Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.

However, if you run the contents below, an error occurs. 6. Once the model/s you want have been downloaded, you can run the model locally using the command below:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir Meta-Llama-3-8B-Instruct/ \
    --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

If I execute the above command, the following error is output.

[W CUDAFunctions.cpp:108] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())
Traceback (most recent call last):
  File "example_chat_completion.py", line 84, in <module>
    fire.Fire(main)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example_chat_completion.py", line 31, in main
    generator = Llama.build(
  File "/database/hanjun/llama3/llama3/llama/generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1339, in _new_process_group_helper
    backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
[2024-04-19 04:48:59,043] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3342602) of binary: /opt/anaconda3/envs/llama3/bin/python
Traceback (most recent call last):
  File "/opt/anaconda3/envs/llama3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-19_04:48:59
  host      : tmaxrg
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3342602)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

In my opinion, the cause of the problem is that the NVIDIA driver is old and not compatible with the current CUDA version. Also, it is presumed to be a problem that the NCCL backend is not available because the GPU is not detected. I would greatly appreciate it if you could let me know the NVIDIA driver specifications and the version of CUDA you recommend in relation to it. My current my Python version is 3.8.0, CUDA version is 11.7, and my GPU is using RTX 2080 TI 12GB x 2 devices.

$ When I execute $ nvcc -V command, it is output as follows.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

$ When I execute $ nvidia-smi command, it will be output as follows.

Fri Apr 19 05:19:35 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 26%   37C    P8    12W / 257W |     14MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 25%   34C    P8    16W / 257W |      5MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1437      G   /usr/lib/xorg/Xorg                  8MiB |
|    0   N/A  N/A      1547      G   /usr/bin/gnome-shell                4MiB |

I would greatly appreciate your help in solving this problem.

Apr 19 '24 05:04 JeongHanJun

Hi @JeongHanJun can you try to run the code without torchrun --nproc_per_node 1 like following:

python3 example_chat_completion.py \
    --ckpt_dir Meta-Llama-3-8B-Instruct/ \
    --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Apr 19 '24 05:04 bkhanal-11

Hello @bkhanal-11! If I run the code you told me, it will be output as follows.

Traceback (most recent call last):
  File "example_chat_completion.py", line 84, in <module>
    fire.Fire(main)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example_chat_completion.py", line 31, in main
    generator = Llama.build(
  File "/database/hanjun/llama3/llama3/llama/generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 234, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

I'm learning another model now, but I'm not sure if outputting this error has anything to do with learning a model. When I run $ nvidia-smi, it's output as follows.

Fri Apr 19 06:25:04 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|100%   85C    P2   205W / 257W |  10144MiB / 11264MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 60%   75C    P2   213W / 257W |  10495MiB / 11264MiB |     96%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

When I'm done learning the model, I'll run the code you gave me again. Any additional advice or how to fix the above error would be greatly appreciated!

Apr 19 '24 06:04 JeongHanJun

Hello @bkhanal-11! I stopped training the model, and execute the command you gave me. Followings are printed

Traceback (most recent call last):
  File "example_chat_completion.py", line 84, in <module>
    fire.Fire(main)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example_chat_completion.py", line 31, in main
    generator = Llama.build(
  File "/database/hanjun/llama3/llama3/llama/generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 234, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "/opt/anaconda3/envs/llama3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

Please tell me what cause this error, and how can I solve it.

Apr 19 '24 08:04 JeongHanJun

@JeongHanJun Can you make sure that the torch is correctly configured with cuda? May be print the following:

import torch
print(torch.cuda.is_available())

This should print True. As it says ProcessGroupNCCL is only supported with GPUs, no GPUs found! in the original error, I have a doubt the Pytorch version is not correctly configured.

Apr 19 '24 08:04 bkhanal-11

same problem with me!

I have 3 T4 GPU, if with --nproc_per_node 1 got error "torch.cuda.OutOfMemoryError: CUDA out of memory. " if run without --nproc_per_node 1 then got"ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set"

import torch pring(torch.cuda.is_available()) Traceback (most recent call last): File "", line 1, in NameError: name 'pring' is not defined. Did you mean: 'print'? print(torch.cuda.is_available()) True quit()

(Llama3) xty@xty-server4:~/ChatLLM/llama3$ pip list Package Version Editable project location

blobfile 2.1.1 certifi 2024.2.2 charset-normalizer 3.3.2 fairscale 0.4.13 filelock 3.13.4 fire 0.6.0 fsspec 2024.3.1 idna 3.7 Jinja2 3.1.3 llama3 0.0.1 /home/xty/ChatLLM/llama3 lxml 4.9.4 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.3 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 pip 24.0 pycryptodomex 3.20.0 regex 2024.4.16 requests 2.31.0 setuptools 69.5.1 six 1.16.0 sympy 1.12 termcolor 2.4.0 tiktoken 0.4.0 torch 2.2.2 triton 2.2.0 typing_extensions 4.11.0 urllib3 2.2.1 wheel 0.43.0 (Llama3) xty@xty-server4:~/ChatLLM/llama3$ nvidia-smi Fri Apr 19 20:03:37 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:31:00.0 Off | 0 | | N/A 42C P8 12W / 70W | 6MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:98:00.0 Off | 0 | | N/A 45C P8 12W / 70W | 6MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla T4 Off | 00000000:B1:00.0 Off | 0 | | N/A 44C P8 11W / 70W | 6MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2457 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2457 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 2457 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+

can you give me some help?

Apr 19 '24 12:04 Williamway7384

Hi, @Williamway7384 Unfortunately, it is a slightly different situation from the Out of Memory problem. The reason I think so is because I don't carry out at all. If you look at the pip list in this repository, there are several settings related to torch version 2.2.2 and nvidia-cuda. In my opinion, it seems to support CUDA 12.0 or later in most cases, but it's not accurate. In my case, I'm currently using the version of CUDA 11.7. I'm guessing there's been a problem in this regard. If anyone runs well normally, I would greatly appreciate it if you could tell me the CUDA/CuDNN/torch version and the library version related to CUDA. Unfortunately, the GPU I'm currently using is a shared one with others, so it's hard to upgrade the CUDA version at will. So I'm thinking of looking at other GPU environments and then running llama3.

If anyone experiences a similar problem with me, or if anyone has solved it, please share how you solved it!

Apr 23 '24 01:04 JeongHanJun

Got the same error- any solutions?

May 01 '24 21:05 sid0913

@sid0913 You need to upgrade Cuda Version. I was execute in Google Colab, and it works well

May 02 '24 09:05 JeongHanJun

same problem with me!

I have 3 T4 GPU, if with --nproc_per_node 1 got error "torch.cuda.OutOfMemoryError: CUDA out of memory. " if run without --nproc_per_node 1 then got"ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set"

import torch pring(torch.cuda.is_available()) Traceback (most recent call last): File "", line 1, in NameError: name 'pring' is not defined. Did you mean: 'print'? print(torch.cuda.is_available()) True quit()

(Llama3) xty@xty-server4:~/ChatLLM/llama3$ pip list Package Version Editable project location

blobfile 2.1.1 certifi 2024.2.2 charset-normalizer 3.3.2 fairscale 0.4.13 filelock 3.13.4 fire 0.6.0 fsspec 2024.3.1 idna 3.7 Jinja2 3.1.3 llama3 0.0.1 /home/xty/ChatLLM/llama3 lxml 4.9.4 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.3 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 pip 24.0 pycryptodomex 3.20.0 regex 2024.4.16 requests 2.31.0 setuptools 69.5.1 six 1.16.0 sympy 1.12 termcolor 2.4.0 tiktoken 0.4.0 torch 2.2.2 triton 2.2.0 typing_extensions 4.11.0 urllib3 2.2.1 wheel 0.43.0 (Llama3) xty@xty-server4:~/ChatLLM/llama3$ nvidia-smi Fri Apr 19 20:03:37 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:31:00.0 Off | 0 | | N/A 42C P8 12W / 70W | 6MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:98:00.0 Off | 0 | | N/A 45C P8 12W / 70W | 6MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla T4 Off | 00000000:B1:00.0 Off | 0 | | N/A 44C P8 11W / 70W | 6MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2457 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2457 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 2457 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+

can you give me some help?

Hey, Have you fixed this bug?

May 04 '24 02:05 CharlesHehe

@sid0913 You need to upgrade Cuda Version. I was execute in Google Colab, and it works well

I'm using cuda 12.3 and 12.4 is the latest. It's not as old as the 11s.

I'm hesitant to upgrade cuda right now. Are you sure about the version being the cause?

May 06 '24 23:05 sid0913