llama-cookbook icon indicating copy to clipboard operation
llama-cookbook copied to clipboard

RuntimeError: CUDA Setup failed despite GPU being available.

Open klyuhang9 opened this issue 2 years ago • 5 comments

System Info

Hi, friends,

My environment is: 1.CentOs Linux release 7.6.1810 (Core)
2.CUDA : nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0 3.GPU : Mon Oct 9 17:33:48 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | Off | | N/A 40C P0 57W / 300W | 13052MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | Off | | N/A 41C P0 56W / 300W | 18468MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:0B.0 Off | Off | | N/A 34C P0 40W / 300W | 5MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:0C.0 Off | Off | | N/A 40C P0 58W / 300W | 15992MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:00:0D.0 Off | Off | | N/A 73C P0 283W / 300W | 24014MiB / 32768MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:00:0E.0 Off | Off | | N/A 36C P0 41W / 300W | 5MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:00:0F.0 Off | Off | | N/A 40C P0 54W / 300W | 2216MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:00:10.0 Off | Off | | N/A 37C P0 40W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

  1. Pytorch :2.0.1+cu117

  2. Model : Llama-2-7b-chat-hf (I directly download from huggingface:https://huggingface.co/meta-llama/Llama-2-7b-hf)

  3. other library: accelerate 0.23.0 aiohttp 3.8.6 aiosignal 1.3.1 appdirs 1.4.4 asttokens 2.4.0 async-timeout 4.0.3 attrs 23.1.0 backcall 0.2.0 bitsandbytes 0.41.1 black 23.9.1 Brotli 1.1.0 certifi 2023.7.22 charset-normalizer 3.3.0 click 8.1.7 cmake 3.25.0 coloredlogs 15.0.1 datasets 2.14.5 decorator 5.1.1 dill 0.3.7 exceptiongroup 1.1.3 executing 2.0.0 fairscale 0.4.13 filelock 3.12.4 fire 0.5.0 frozenlist 1.4.0 fsspec 2023.6.0 huggingface-hub 0.17.3 humanfriendly 10.0 idna 3.4 inflate64 0.3.1 ipython 8.16.1 jedi 0.19.1 Jinja2 3.1.2 lit 15.0.7 llama 0.0.1 /data/homework/zyb_wyh/llama-main llama-recipes 0.0.1 loralib 0.1.2 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.15 multivolumefile 0.2.3 mypy-extensions 1.0.0 networkx 3.1 numpy 1.26.0 optimum 1.13.2 packaging 23.2 pandas 2.1.1 parso 0.8.3 pathspec 0.11.2 peft 0.5.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.3.0 pip 23.2.1 platformdirs 3.11.0 prompt-toolkit 3.0.39 protobuf 4.24.4 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 py7zr 0.20.6 pyarrow 13.0.0 pybcj 1.0.1 pycryptodomex 3.19.0 Pygments 2.16.1 pyppmd 1.0.0 python-dateutil 2.8.2 pytz 2023.3.post1 PyYAML 6.0.1 pyzstd 0.15.9 regex 2023.10.3 requests 2.31.0 safetensors 0.4.0 scipy 1.11.3 sentencepiece 0.1.99 setuptools 68.2.2 six 1.16.0 stack-data 0.6.3 sympy 1.12 termcolor 2.3.0 texttable 1.7.0 tokenize-rt 5.2.0 tokenizers 0.14.1 tomli 2.0.1 torch 2.0.1+cu117 torchaudio 2.0.2+cu117 torchvision 0.15.2+cu117 tqdm 4.66.1 traitlets 5.11.2 transformers 4.34.0 triton 2.0.0 typing_extensions 4.8.0 tzdata 2023.3 urllib3 2.0.6 wcwidth 0.2.8 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.2

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

I want to fine-tuning the model Llama-2-7b-chat-hf ,so I run the code python -m llama_recipes.finetuning --use_peft --peft_method lora --model_name /data/homework/zyb_wyh/llama-main/model-llama/Llama-2-7b-chat-hf --output_dir model/

Error logs

However I get a result as follow: /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /data/homework/env/anaconda3/envs/llama-recipes did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2 warn(msg) /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/cuda/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) The following directories listed in your path were found to be non-existent: {PosixPath('FILE')} CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')} CUDA SETUP: PyTorch settings found: CUDA_VERSION=117, Highest Compute Capability: 7.0. CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! If you run into issues with 8-bit matmul, you can try 4-bit quantization: https://huggingface.co/blog/4bit-transformers-bitsandbytes warn(msg) CUDA SETUP: Loading binary /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so... /lib64/libstdc++.so.6: version CXXABI_1.3.9' not found (required by /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so) CUDA SETUP: Something unexpected happened. Please compile from source: git clone https://github.com/TimDettmers/bitsandbytes.git cd bitsandbytes CUDA_VERSION=117 make cuda11x_nomatmul python setup.py install Traceback (most recent call last): File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/runpy.py", line 187, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/runpy.py", line 146, in _get_module_details return _get_module_details(pkg_main_name, error) File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/runpy.py", line 110, in _get_module_details import(pkg_name) File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/init.py", line 6, in from . import cuda_setup, utils, research File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/research/init.py", line 1, in from . import nn File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/research/nn/init.py", line 1, in from .modules import LinearFP8Mixed, LinearFP8Global File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in from bitsandbytes.optim import GlobalOptimManager File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/optim/init.py", line 6, in from bitsandbytes.cextension import COMPILED_WITH_CUDA File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in raise RuntimeError(''' RuntimeError: CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues`

Expected behavior

That is all my environment, I don't know why this happen and how to solve it. I sincerely want to get the help about it .,Thanks!

klyuhang9 avatar Oct 09 '23 09:10 klyuhang9

I run echo $LD_LIBRARY_PATH the result is /usr/local/cuda/lib64

And I run torch.cuda.is_available() the result is True

klyuhang9 avatar Oct 09 '23 09:10 klyuhang9

While waiting anxiously, many thanks for the help.

klyuhang9 avatar Oct 09 '23 09:10 klyuhang9

@klyuhang9 seems more of a path issue, wonfer if

adding the following to your ~/.bashrc helps, make sure to re-init your bashrc --> source ~/.bashrc after that.

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_NVCC_EXECUTABLE=/usr/local/cuda/bin/nvcc

HamidShojanazeri avatar Oct 13 '23 15:10 HamidShojanazeri

@HamidShojanazeri I try, but it didn't help, the same error again

klyuhang9 avatar Oct 14 '23 09:10 klyuhang9

I encountered the same problem, did you solve it?

chuqidecha avatar Nov 20 '23 10:11 chuqidecha

Please feel free to re-open if fresh install/updating path doesn't solve it!

init27 avatar Aug 19 '24 17:08 init27