text-generation-webui
text-generation-webui copied to clipboard
Draft: Add support for llama.cpp
My proof of concept for adding support for llama.cpp. It requires my experimental Python bindings (v0.1.8 and up). This has no dependencies (with some caveats mentioned below).
This is what is in models/llamacpp-7B
right now
> ls -al models/llamacpp-7B/
total 70746800
drwxr-xr-x@ 9 thomas staff 288 Mar 19 17:59 .
drwxr-xr-x@ 9 thomas staff 288 Mar 10 22:04 ..
-rw-r--r--@ 1 thomas staff 100 Mar 10 22:04 checklist.chk
-rw-r--r-- 1 thomas staff 118 Mar 19 18:00 config.json
-rw-r--r--@ 1 thomas staff 13476939516 Mar 10 22:35 consolidated.00.pth
-rw-r--r-- 1 thomas staff 13477682665 Mar 11 13:40 ggml-model-f16.bin
-rw-r--r-- 1 thomas staff 4212727273 Mar 12 18:39 ggml-model-q4_0.bin
-rw-r--r-- 1 thomas staff 5054995945 Mar 12 19:22 ggml-model-q4_1.bin
-rw-r--r--@ 1 thomas staff 101 Mar 10 22:03 params.json
-
Only the
ggml-model-q4_0.bin
file is required right now (and it is hardcoded). The bigger models should also work as long as the folder names start withllamacpp-
. I am working on adding support for thealpaca
model right now. It already works in the C++ version and in my Python scripts. So it is only a matter of adapting it fortextui
. -
The model files can be created from the PyTorch model using the
llamacpp-convert
andllamacpp-quantize
commands that are installed along with thellamacpp
. Using these commands requires thattorch
andsentencepiece
be installed as well. -
There is currently no option to update the parameters like top_p, top_k etc. other than hardcoding it in
llamacpp_model.py
. This is on my todo list of things to fix.
This is really promising.
I see that llama.cpp has added a C-style API, exciting stuff!
I see that llama.cpp has added a C-style API, exciting stuff!
Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.
I see that llama.cpp has added a C-style API, exciting stuff!
Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.
I hope that isn't a set-back to your work. I appreciate all the time you are putting into this project!
There's also this project that might be useful: https://github.com/PotatoSpudowski/fastLLaMa
will it support https://github.com/AlpinDale/pygmalion.cpp
I see that llama.cpp has added a C-style API, exciting stuff!
Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.
I hope that isn't a set-back to your work. I appreciate all the time you are putting into this project!
His new API is quite a bit cleaner than my previous work which was sort of put together quickly. However, the new one is a bit more minimal, so I need some additional work around it to get it to where it was before. I am currently blocked on a segfault that I can hopefully get to later today or tomorrow.
@thomasantony Will the llama.cpp be placed in the "repositories" folder, similar to "GPTQ-for-LLaMa"? If so, that's great as updating the web-ui will also result in an update of the llama.cpp repository.
I like the code so far and appreciate that it adheres to the style/structure of the project.
@thomasantony I have made some changes that made this functional for me. The main parameters are all used: temperature, top_k, top_p, and repetition_penalty.
These were the steps to get it working:
- Install version 0.1.10 of
llamacpp
:
pip install llamacpp==0.1.10
- Create the folder
models/llamacpp-7b
- Put this file in it: ggml-model-q4_0.bin
- Start the web UI with
python server.py --model llamacpp-7b
After that it worked.
Thanks for the changes. I just released v0.1.11 - this includes the new memory mapped I/O feature and requires updating the weight files. But it makes loading the models a whole lot faster (and may allow running models bigger than your RAM but I have not tried it yet and may be wrong about this).
The API should be consistent and work with textui with out any changes.
Is it possible to find the new weights on hugging face somewhere?
You can use the updated "llamacpp-convert" script with the original Llama weights (pytorch format) to generate the new ggml weights. Another option is to use the "migrate" script from https://github.com/ggerganov/llama.cpp . That can convert existing "ggml" weights into the new format.
Would this support using and interacting with alpaca and llama models of all sizes?
@thomasantony I did the convertion from the base LLaMA files and that worked.
This was the performance of llama-7b int4 on my i5-12400F:
Output generated in 44.10 seconds (4.53 tokens/s, 200 tokens)
Well, feel free to merge it! . I am glad that I was able to contribute. :).
Thank you so much for this brilliant PR, @thomasantony!
The new documentation is here: https://github.com/oobabooga/text-generation-webui/wiki/llama.cpp-models
@thomasantony I have just noticed that the parameters are not really being used. Assigning to the params
variable here doesn't change the parameters inside the model: https://github.com/oobabooga/text-generation-webui/blob/main/modules/llamacpp_model.py#L43
Is the only way to change the model parameters to reload it from scratch like this?
_model = llamacpp.LlamaInference(params)
@oobabooga That is a side effect of how the underlying Python bindings works right now. Adding support for changing those parameters when sampling from the logits is on my ToDo list. Right now, it is only possible if you use the LlamaContext class (which is more low level), instead of the higher-level LlamaInference which currently does not allow changing the parameters post-initialization. This is probably the next thing I will update in the library. I will post back here once that is done, or make a separate PR with the changes.
is it possible to use vram and ram like this? https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model
@niizam 4-bit quantized models are already supported. You just need to use the appropriate weight files.