llama-cookbook
llama-cookbook copied to clipboard
RuntimeError: CUDA Setup failed despite GPU being available.
System Info
Hi, friends,
My environment is:
1.CentOs Linux release 7.6.1810 (Core)
2.CUDA :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
3.GPU :
Mon Oct 9 17:33:48 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | Off |
| N/A 40C P0 57W / 300W | 13052MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | Off |
| N/A 41C P0 56W / 300W | 18468MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:0B.0 Off | Off |
| N/A 34C P0 40W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:0C.0 Off | Off |
| N/A 40C P0 58W / 300W | 15992MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:00:0D.0 Off | Off |
| N/A 73C P0 283W / 300W | 24014MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:00:0E.0 Off | Off |
| N/A 36C P0 41W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:00:0F.0 Off | Off |
| N/A 40C P0 54W / 300W | 2216MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:00:10.0 Off | Off |
| N/A 37C P0 40W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
-
Pytorch :2.0.1+cu117
-
Model : Llama-2-7b-chat-hf (I directly download from huggingface:https://huggingface.co/meta-llama/Llama-2-7b-hf)
-
other library: accelerate 0.23.0 aiohttp 3.8.6 aiosignal 1.3.1 appdirs 1.4.4 asttokens 2.4.0 async-timeout 4.0.3 attrs 23.1.0 backcall 0.2.0 bitsandbytes 0.41.1 black 23.9.1 Brotli 1.1.0 certifi 2023.7.22 charset-normalizer 3.3.0 click 8.1.7 cmake 3.25.0 coloredlogs 15.0.1 datasets 2.14.5 decorator 5.1.1 dill 0.3.7 exceptiongroup 1.1.3 executing 2.0.0 fairscale 0.4.13 filelock 3.12.4 fire 0.5.0 frozenlist 1.4.0 fsspec 2023.6.0 huggingface-hub 0.17.3 humanfriendly 10.0 idna 3.4 inflate64 0.3.1 ipython 8.16.1 jedi 0.19.1 Jinja2 3.1.2 lit 15.0.7 llama 0.0.1 /data/homework/zyb_wyh/llama-main llama-recipes 0.0.1 loralib 0.1.2 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.15 multivolumefile 0.2.3 mypy-extensions 1.0.0 networkx 3.1 numpy 1.26.0 optimum 1.13.2 packaging 23.2 pandas 2.1.1 parso 0.8.3 pathspec 0.11.2 peft 0.5.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.3.0 pip 23.2.1 platformdirs 3.11.0 prompt-toolkit 3.0.39 protobuf 4.24.4 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 py7zr 0.20.6 pyarrow 13.0.0 pybcj 1.0.1 pycryptodomex 3.19.0 Pygments 2.16.1 pyppmd 1.0.0 python-dateutil 2.8.2 pytz 2023.3.post1 PyYAML 6.0.1 pyzstd 0.15.9 regex 2023.10.3 requests 2.31.0 safetensors 0.4.0 scipy 1.11.3 sentencepiece 0.1.99 setuptools 68.2.2 six 1.16.0 stack-data 0.6.3 sympy 1.12 termcolor 2.3.0 texttable 1.7.0 tokenize-rt 5.2.0 tokenizers 0.14.1 tomli 2.0.1 torch 2.0.1+cu117 torchaudio 2.0.2+cu117 torchvision 0.15.2+cu117 tqdm 4.66.1 traitlets 5.11.2 transformers 4.34.0 triton 2.0.0 typing_extensions 4.8.0 tzdata 2023.3 urllib3 2.0.6 wcwidth 0.2.8 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.2
Information
- [X] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
I want to fine-tuning the model Llama-2-7b-chat-hf ,so I run the code
python -m llama_recipes.finetuning --use_peft --peft_method lora --model_name /data/homework/zyb_wyh/llama-main/model-llama/Llama-2-7b-chat-hf --output_dir model/
Error logs
However I get a result as follow:
/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /data/homework/env/anaconda3/envs/llama-recipes did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2 warn(msg) /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/cuda/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) The following directories listed in your path were found to be non-existent: {PosixPath('FILE')} CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')} CUDA SETUP: PyTorch settings found: CUDA_VERSION=117, Highest Compute Capability: 7.0. CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! If you run into issues with 8-bit matmul, you can try 4-bit quantization: https://huggingface.co/blog/4bit-transformers-bitsandbytes warn(msg) CUDA SETUP: Loading binary /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so... /lib64/libstdc++.so.6: version CXXABI_1.3.9' not found (required by /data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so)
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=117 make cuda11x_nomatmul
python setup.py install
Traceback (most recent call last):
File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/runpy.py", line 146, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File "/data/homework/env/anaconda3/envs/llama-recipes/lib/python3.10/site-packages/bitsandbytes/init.py", line 6, in
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues`
Expected behavior
That is all my environment, I don't know why this happen and how to solve it. I sincerely want to get the help about it .,Thanks!
I run echo $LD_LIBRARY_PATH the result is /usr/local/cuda/lib64
And I run torch.cuda.is_available()
the result is True
While waiting anxiously, many thanks for the help.
@klyuhang9 seems more of a path issue, wonfer if
adding the following to your ~/.bashrc helps, make sure to re-init your bashrc --> source ~/.bashrc after that.
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_NVCC_EXECUTABLE=/usr/local/cuda/bin/nvcc
@HamidShojanazeri I try, but it didn't help, the same error again
I encountered the same problem, did you solve it?
Please feel free to re-open if fresh install/updating path doesn't solve it!