llama-cpp-python
llama-cpp-python copied to clipboard
Huge difference in performance between llama.cpp and llama-cpp-python
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ X] I carefully followed the README.md.
- [ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [ X] I reviewed the Discussions, and have a new bug or useful enhancement to share.
I'm running a bot on Libera IRC and the difference between llama.cpp's response time compared to the llama-cpp-python one is pretty huge when maxing out the context lenght.
this is how i run llama.cpp which with the latest update results in a response time of 3 seconds for my bot.
./server -t 8 -a llama-3-8b-instruct -m ./Meta-Llama-3-8B-Instruct-Q6_K.gguf -c 8192 -ngl 100 --timeout 10
this is how i run llama-cpp-python which results in a response time of 18 seconds for my bot
python3 -m llama_cpp.server --model ./Meta-Llama-3-8B-Instruct-Q6_K.gguf --n_threads 8 --n_gpu_layers -1 --n_ctx 8192
Am i doing something wrong or is this normal?
Environment and Context
i experienced that behaviour on linux and windows if self compiled or using the pre compiled wheels
-
Physical (or virtual) hardware you are using, e.g. for Linux: CPU: Model name: 13th Gen Intel(R) Core(TM) i5-13600K GPU: VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090]
-
Operating System, e.g. for Linux i'm at right now:
Linux b6.8.8-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Apr 27 17:53:31 UTC 2024 x86_64 GNU/Linu
- SDK version, e.g. for Linux:
$ python3 --version = Python 3.11.9
$ make --version = GNU Make 4.4.1
$ g++ --version = g++ (GCC) 14.0.1 20240411 (Red Hat 14.0.1-0)
nvcc makes use of gcc 13 = g++-13 (Homebrew GCC 13.2.0) 13.2.0
export NVCC_PREPEND_FLAGS='-ccbin /home/linuxbrew/.linuxbrew/bin/g++-13'
I can also now confirm. Have been using this repo extensively since its inception. Really awesome, appreciate @abetlen and all the others for making this software. I was looking at all the tickets mentioning this speed inconsistency between llama.cpp native and llama-cpp python. I tried loading the Meta llama 3 8B variant on both programs, same init settings. Unfortunately llama-cpp speed up is incredibly noticeable.
I can help debug in any way possible, just let me know what would be good information to relay to the repo contributors. I am using a 3060 gpu on windows 10, both variants (llama cpp, llama-cpp python) were gpu enabled with maximum gpu offloading.
i can also supply a database with test data for better reproduction if there is any need for it. the decrease in speed is increasing with the context lenght. thats so far my observation.
I can confirm this. Even without maxing out the context length, the performance difference is noticeable.
Hi, I've probably been struggling with this for the last day too.
I did find that setting the logits_all parameter to false (its true by default) appeared to increase the toks/second from about 8 to about 23 on a machine I have that is stuffed with old nvidia gaming cards. 23 toks/second what what I was getting running the llama.cpp inferencing directly.
I have no idea what logits are as I am a bit new to this. But, at least it's something to try out.
The logits_all parameter is a model setting in my OpenAI-Like server configuration file. No doubt, there is also a command line option for it too.
If these mysterious logits do turn out to be necessary for something, then I guess I will add another almost identical model in my configuration file with them turned on.
I have no idea what logits are as I am a bit new to this. But, at least it's something to try out.
Is this option by default turned on? (It shouldn't be) because for inference we only need the logits of the last token.
It is on by default according to this page:
https://llama-cpp-python.readthedocs.io/en/latest/server/#llama_cpp.server.settings.ModelSettings
And thats what my experiment confirmed.
Thank you.
Thank you! Now it all makes sense.
well, just want to report that i just returned after some abstinence to play arround with my bot again and it responds in 4-5 seconds with a completely filled context for me right now. So i actually cant tell how and why it got fixed but it seems fixed for me using the same config as i had before. i'm closing this as fixed now.