llama-cpp-python
llama-cpp-python copied to clipboard
Anyone built with vulkan yet?
I updated llama.cpp and compiled with vulkan=1 but I get an error about compiling with -fPIC enabled.
@Ph0rk0z give it a shot again with v0.2.36 the cmake build was fixed in https://github.com/ggerganov/llama.cpp/pull/5182
llama-cpp-python worked fine with Vulkan last night (on Linux) when I built it with my PR https://github.com/ggerganov/llama.cpp/pull/5182.
It built but I have no way to select what GPU to use.
I got it to compile and install, and I can generate text much faster with it compared to clblast, but sadly the save state doesn't work, I will probably open a new issue about it tomorrow.
I'm not real sure what I'm doing wrong with it personally, but I did manage to get it to run and compile in docker. I am assuming it is selecting the wrong GPU but I am not sure how to change it. I tried changing main_gpu
from 0 to 1 when initializing Llama
, but got the same result. My laptop has an AMD GPU and an nvidia GPU, needs to run on the nvidia one.
Any ideas?
@Josh-XT Have you tried with lower values of n_ctx
? 32768 is a lot. Try something more reasonable like 1024.
@Josh-XT Have you tried with lower values of
n_ctx
? 32768 is a lot. Try something more reasonable like 1024.
I'll give that a try. I actually just had it set to 0.
@Josh-XT Also, probably related to your GPU selection issue: https://github.com/ggerganov/llama.cpp/issues/5259#issuecomment-1921709786
You can choose GPU with GGML_VULKAN_DEVICE= environment variable. 0 is first GPU or CPU depending on setup.
I tried it out on two machines, one with accelerated Nvidia drivers, one without.
Vulkan recognizes the proprietary Nvidia driver. Performance is right in between nvidia's cuBLAS
(using all GPU layers) and openBLAS/
cuBLAS` without GPU support.
Proprietary Nvidia drivers:
-
cuBLAS
with all graphics layers (-ngl 99
): 33 tokens/sec. - Proprietary Nvidia
Vulkan
with GPU: 22 tokens/sec. - Proprietary Nvidia
cuBLAS
without-ngl 99
: 14 tokens/sec.
Open-source nouveau
mesa drivers:
-
openBLAS
(CPU-only): 13-14 tokens/sec. - Proprietary Nvidia
Vulkan
(CPU-only): 12 tokens/sec. - Plain compile (no options): 7 tokens/sec.
I'll keep using cuBLAS, but keep an eye on it. Performance will no doubt improve as all the debugging code and safety checks add significant CPU overhead at this stage of development.
Update: I'm aware there is a third, official "open-source" Nvidia driver on the Nvidia website that is supposed to work with Vulcan. But I'm on mobile so it's timing out on download.
I have succesfully build llamacpp with vulkan, but cant yet build llama-python-cpp with vulkan. I always get core dump when loading the model. With llamacpp it works fines
@userbox020 are you using cmake to build llama.cpp?
@userbox020 are you using cmake to build llama.cpp?
im using vulkan compile llama.cpp