llama-cpp-python
llama-cpp-python copied to clipboard
Python bindings for llama.cpp
llava-phi-3-mini uses the Phi-3-instruct chat template. I think is similar with current llava-1-5, but with Phi3 instruct template instead of llama 2. format: `\nQuestion \n` stop word is for system...
When I start the llava13b model using the llama-cpp-python server, I notice that the GPU memory usage increases a little after each inference, which suggests that the GPU memory is...
# Prerequisites Please answer the following questions for yourself before submitting an issue. - [Yes] I am running the latest code. Development is very rapid so there are no tagged...
`llama-cpp-python` observes a severe bottleneck on the main python thread not otherwise present in `llama.cpp` Running a server with `llama.cpp` directly using ```sh ./server -ngl 999 -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --port 12345...
When running two simultaneous requests, it crashes with a core dump. ``` GGML_ASSERT: /tmp/pip-install-uaaiunx2/llama-cpp-python_d6e61d67fc93418ab936c848aabd7f64/vendor/llama.cpp/ggml.c:4997: ggml_nelements(a) == ne0*ne1*ne2 ``` I'm running version 0.2.63 on Docker with a Nvidia Tesla P40. Here...
Many are using this library on Nvidia Jetson/Orin devices, but there are no prebuilt wheels available for CUDA arm architectures. Could support for automated builds of these wheels be added?...
It's very frustrating that a lot of messages get written to stderr, like model parameters that are very difficult to differentiate from errors. I tried to capture stderr but then...
Hi @abetlen, I checked the parameters in both ```__call__``` and ```create_completion``` method but did not see ```penalty_alpha``` param which represents **contrastive search** decoding. Can you update the decoding strategy soon...