llama-cpp-python
llama-cpp-python copied to clipboard
Failed to load shared library \venv\Lib\site-packages\llama_cpp\llama.dll
Hi,
I am running llama-cpp-python on surface book 2 having i7 with nvidea geforce gtx 1060. I installed vc++, cuda drivers 12.4 Running on Python 3.11.3 Compiled llama using below command on MinGW bash console
CUDACXX="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe" CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade --verbose
It ran successfully and yielded llama.dll However, when i try to load, it is throwing error
File "D:\My_Tech\GenAIPlay\venv\Lib\site-packages\llama_cpp\llama_cpp.py", line 72, in _load_shared_library
raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
RuntimeError: Failed to load shared library 'D:\My_Tech\GenAIPlay\venv\Lib\site-packages\llama_cpp\llama.dll': Could not find module 'D:\My_Tech\GenAIPlay\venv\Lib\site-packages\llama_cpp\llama.dll' (or one of its dependencies). Try using the full path with constructor syntax.
I tried fixing it as per other suggestions by modifying llama_cpp.py where the error was throwing as below but didn't work
return ctypes.CDLL(str(_lib_path),winmode=0)
I did set environment variables as well, still couldn't work. Can you please help how can i fix it ?
$ echo $CUDA_PATH
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4
(venv)
umapa@UMA-SB2 MINGW64 /d/My_Tech/GenAIPlay
$ echo $PATH
D:\My_Tech\GenAIPlay\venv/Scripts:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\libnvvp;C:\Python311\Scripts\;C:\Python31
1\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Users\umapa\AppData\Local\Packages\PythonSoftwareFoundatio
n.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Scripts;C:\Program Files\Git\cmd;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\nodejs\;C:\ProgramData\chocolatey\bi
n;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.1.0\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Prog
ram Files (x86)\Incredibuild;C:\Users\umapa\AppData\Local\Microsoft\WindowsApps;C:\Users\umapa\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages
\Python39\Scripts;;C:\Users\umapa\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\umapa\AppData\Roaming\npm;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\lib;D:\My_Tech\GenAIPlay\venv\Lib\site-packages\llama_cpp
*** Installing project into wheel...
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/lib/ggml_shared.lib
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/bin/ggml_shared.dll
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/lib/cmake/Llama/LlamaConfig.cmake
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/lib/cmake/Llama/LlamaConfigVersion.cmake
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/include/ggml.h
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/include/ggml-alloc.h
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/include/ggml-backend.h
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/include/ggml-cuda.h
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/lib/llama.lib
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/bin/llama.dll
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/include/llama.h
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/bin/convert.py
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/bin/convert-lora-to-ggml.py
-- Installing: C:/Users/umapa/AppData/Local/Temp/tmpffhzeu6k/wheel/platlib/llama_cpp/llama.lib
-- Installing: C:/Users/umapa/AppData/Local/Temp/tmpffhzeu6k/wheel/platlib/llama_cpp/llama.dll
-- Installing: C:/Users/umapa/AppData/Local/Temp/pip-install-cn0mqw5n/llama-cpp-python_bd51aa929f42429a9180b8d6bd519841/llama_cpp/llama.lib
-- Installing: C:/Users/umapa/AppData/Local/Temp/pip-install-cn0mqw5n/llama-cpp-python_bd51aa929f42429a9180b8d6bd519841/llama_cpp/llama.dll
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/lib/llava.lib
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/bin/llava.dll
-- Installing: C:\Users\umapa\AppData\Local\Temp\tmpffhzeu6k\wheel\platlib/bin/llava-cli.exe
-- Installing: C:/Users/umapa/AppData/Local/Temp/tmpffhzeu6k/wheel/platlib/llama_cpp/llava.lib
-- Installing: C:/Users/umapa/AppData/Local/Temp/tmpffhzeu6k/wheel/platlib/llama_cpp/llava.dll
-- Installing: C:/Users/umapa/AppData/Local/Temp/pip-install-cn0mqw5n/llama-cpp-python_bd51aa929f42429a9180b8d6bd519841/llama_cpp/llava.lib
-- Installing: C:/Users/umapa/AppData/Local/Temp/pip-install-cn0mqw5n/llama-cpp-python_bd51aa929f42429a9180b8d6bd519841/llama_cpp/llava.dll
*** Making wheel...
*** Created llama_cpp_python-0.2.56-cp311-cp311-win_amd64.whl...
Building wheel for llama-cpp-python (pyproject.toml) ... done
Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.56-cp311-cp311-win_amd64.whl size=22345276 sha256=ffffb1a35fc1e2b8a49a1c90d0e4f4a490ed71c298a29cd468816bbb6251aad7
Stored in directory: C:\Users\umapa\AppData\Local\Temp\pip-ephem-wheel-cache-px49kwop\wheels\f5\48\62\014b1a3c38f77df21219f81ed63ca4c09531d52a205b15d8e4
Successfully built llama-cpp-python
Installing collected packages: typing-extensions, numpy, MarkupSafe, diskcache, jinja2, llama-cpp-python
It worked after removing the cdll_args in return statement of llama_cpp.py
if _lib_path.exists():
try:
return ctypes.CDLL(str(_lib_path))
#return ctypes.CDLL(str(_lib_path), **cdll_args)
However i still see that CPU is being used for computes and not the GTX 1060 Graphic Card. I am invoking the model as below
llama_model = Llama(model_path=model_path, n_gpu_layers=50)
output = llama_model(question,max_tokens=5000)
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 2.51 B
llm_load_print_meta: model size = 1.39 GiB (4.75 BPW)
llm_load_print_meta: general.name = gemma-2b-it
llm_load_print_meta: BOS token = 2 '<bos>'
llm_load_print_meta: EOS token = 1 '<eos>'
llm_load_print_meta: UNK token = 3 '<unk>'
llm_load_print_meta: PAD token = 0 '<pad>'
llm_load_print_meta: LF token = 227 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.06 MiB
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors: CPU buffer size = 1420.21 MiB
............................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
WARNING: failed to allocate 9.00 MB of pinned memory: CUDA driver version is insufficient for CUDA runtime version
llama_kv_cache_init: CPU KV buffer size = 9.00 MiB
llama_new_context_with_model: KV self size = 9.00 MiB, K (f16): 4.50 MiB, V (f16): 4.50 MiB
WARNING: failed to allocate 6.01 MB of pinned memory: CUDA driver version is insufficient for CUDA runtime version
llama_new_context_with_model: CPU input buffer size = 6.01 MiB
WARNING: failed to allocate 504.25 MB of pinned memory: CUDA driver version is insufficient for CUDA runtime version
llama_new_context_with_model: CUDA_Host compute buffer size = 504.25 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': 'gemma-2b-it', 'general.architecture': 'gemma', 'gemma.context_length': '8192', 'gemma.block_count': '18', 'gemma.attention.head_count_kv': '1', 'gemma.embeddin
g_length': '2048', 'gemma.feed_forward_length': '16384', 'gemma.attention.head_count': '8', 'gemma.attention.key_length': '256', 'gemma.attention.value_length': '256', 'gemma.attention.layer_no
rm_rms_epsilon': '0.000001', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '2', 'general.file_type': '15', 'tokenizer.ggml.eos_token_id': '1', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '3'}
Using fallback chat format: None
I have the same issue, the solution of removing arguments as suggested by @mahesh557 did not help me.
I had the same issue, but it was not a bug in my case; I simply had not yet installed the cuda toolkit.
First, uninstall llama-cpp-python and install the cuda toolkit from https://developer.nvidia.com/cuda-toolkit.
You should find CUDA_PATH in the command prompt restarted.
echo %CUDA_PATH%
If CUDA_PATH isn't registered correctly with os.add_dll_directory(),
CDLL() may refuse to load the dependencies of llama.dll.
Then, install the binary that supports your cuda and python version in the releases section. https://github.com/abetlen/llama-cpp-python/releases
For example, install v0.2.69 in python 3.11 and cuda 12.2 environment:
pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.69-cu122/llama_cpp_python-0.2.69-cp311-cp311-win_amd64.whl
For some reason only b3259 was working for me (llama.cpp), so I simply checked that out and merged b3259 which introduced the gemma2 architecture that I needed. The .dll files that were generated ended up working in llama-cpp_python, altough I had to add os.add_dll_directory(dll_directory) to the llama_cpp.py file to make it work.
Will try it out later today by compiling the new files with Visual Studio instead, as I suspect that the problem could lie there. (Altough the compiled binaries in the releases also don't work, so maybe not.)
Edit: I think the problem was actually not copying all the neccessary dll files to the correct folder, that is llama_cpp. Edit 2: Above statement is true. I also had to remove the cdll_args["winmode"] line, to fix the FileNotFound error. Not sure why.
I have the same issue, but only for 0.2.77 and above. Version 0.2.76 works perfectly fine. I install llama-cpp using a pre-built wheel for CPU. Setting any environment variables is not possible, since these are admin controlled in my company. Did anything change between 0.2.76 and 0.2.77 that could cause this behaviour?
Same problem. Python 12.3.4, CUDA 12.4/12.5. ENV variable CUDA_PATH set to cuda installation. Commenting cdll_args["winmode"] = ctypes.RTLD_GLOBAL in llama_cpp.py fixed it.