text-generation-webui Draft: Add support for llama.cpp

Draft: Add support for llama.cpp

Open thomasantony opened this issue 1 year ago • 6 comments

My proof of concept for adding support for llama.cpp. It requires my experimental Python bindings (v0.1.8 and up). This has no dependencies (with some caveats mentioned below).

This is what is in models/llamacpp-7B right now

> ls -al  models/llamacpp-7B/
total 70746800
drwxr-xr-x@ 9 thomas  staff          288 Mar 19 17:59 .
drwxr-xr-x@ 9 thomas  staff          288 Mar  10 22:04 ..
-rw-r--r--@ 1 thomas  staff          100 Mar  10 22:04 checklist.chk
-rw-r--r--  1 thomas  staff          118 Mar 19 18:00 config.json
-rw-r--r--@ 1 thomas  staff  13476939516 Mar  10 22:35 consolidated.00.pth
-rw-r--r--  1 thomas  staff  13477682665 Mar 11 13:40 ggml-model-f16.bin
-rw-r--r--  1 thomas  staff   4212727273 Mar 12 18:39 ggml-model-q4_0.bin
-rw-r--r--  1 thomas  staff   5054995945 Mar 12 19:22 ggml-model-q4_1.bin
-rw-r--r--@ 1 thomas  staff          101 Mar  10 22:03 params.json

Only the ggml-model-q4_0.bin file is required right now (and it is hardcoded). The bigger models should also work as long as the folder names start with llamacpp-. I am working on adding support for the alpaca model right now. It already works in the C++ version and in my Python scripts. So it is only a matter of adapting it for textui.
The model files can be created from the PyTorch model using the llamacpp-convert and llamacpp-quantize commands that are installed along with the llamacpp. Using these commands requires that torch and sentencepiece be installed as well.
There is currently no option to update the parameters like top_p, top_k etc. other than hardcoding it in llamacpp_model.py. This is on my todo list of things to fix.

Mar 20 '23 03:03 thomasantony

This is really promising.

Mar 20 '23 19:03 oobabooga

I see that llama.cpp has added a C-style API, exciting stuff!

Mar 22 '23 06:03 madmads11

I see that llama.cpp has added a C-style API, exciting stuff!

Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.

Mar 22 '23 06:03 thomasantony

I see that llama.cpp has added a C-style API, exciting stuff!

Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.

I hope that isn't a set-back to your work. I appreciate all the time you are putting into this project!

Mar 22 '23 06:03 madmads11

There's also this project that might be useful: https://github.com/PotatoSpudowski/fastLLaMa

Mar 22 '23 09:03 TheTerrasque

will it support https://github.com/AlpinDale/pygmalion.cpp

Mar 23 '23 11:03 x-legion

I see that llama.cpp has added a C-style API, exciting stuff!

Yea. My bindings were based on my own C++ API (#77 which is now closed). Georgi decided that it was too much C++ and wanted a C-style API. I might migrate my python bindings to the new API once it is merged in.

I hope that isn't a set-back to your work. I appreciate all the time you are putting into this project!

His new API is quite a bit cleaner than my previous work which was sort of put together quickly. However, the new one is a bit more minimal, so I need some additional work around it to get it to where it was before. I am currently blocked on a segfault that I can hopefully get to later today or tomorrow.

Mar 27 '23 17:03 thomasantony

@thomasantony Will the llama.cpp be placed in the "repositories" folder, similar to "GPTQ-for-LLaMa"? If so, that's great as updating the web-ui will also result in an update of the llama.cpp repository.

Mar 27 '23 17:03 BadisG

I like the code so far and appreciate that it adheres to the style/structure of the project.

Mar 30 '23 00:03 oobabooga

@thomasantony I have made some changes that made this functional for me. The main parameters are all used: temperature, top_k, top_p, and repetition_penalty.

These were the steps to get it working:

Install version 0.1.10 of llamacpp:

pip install llamacpp==0.1.10

Create the folder models/llamacpp-7b
Put this file in it: ggml-model-q4_0.bin
Start the web UI with

python server.py --model llamacpp-7b

After that it worked.

Mar 31 '23 17:03 oobabooga

Thanks for the changes. I just released v0.1.11 - this includes the new memory mapped I/O feature and requires updating the weight files. But it makes loading the models a whole lot faster (and may allow running models bigger than your RAM but I have not tried it yet and may be wrong about this).

The API should be consistent and work with textui with out any changes.

Mar 31 '23 17:03 thomasantony

Is it possible to find the new weights on hugging face somewhere?

Mar 31 '23 17:03 oobabooga

You can use the updated "llamacpp-convert" script with the original Llama weights (pytorch format) to generate the new ggml weights. Another option is to use the "migrate" script from https://github.com/ggerganov/llama.cpp . That can convert existing "ggml" weights into the new format.

Mar 31 '23 17:03 thomasantony

Would this support using and interacting with alpaca and llama models of all sizes?

Mar 31 '23 17:03 madmads11

@thomasantony I did the convertion from the base LLaMA files and that worked.

This was the performance of llama-7b int4 on my i5-12400F:

Output generated in 44.10 seconds (4.53 tokens/s, 200 tokens)

Mar 31 '23 18:03 oobabooga

Well, feel free to merge it! . I am glad that I was able to contribute. :).

Mar 31 '23 18:03 thomasantony

Thank you so much for this brilliant PR, @thomasantony!

The new documentation is here: https://github.com/oobabooga/text-generation-webui/wiki/llama.cpp-models

Mar 31 '23 18:03 oobabooga

@thomasantony I have just noticed that the parameters are not really being used. Assigning to the params variable here doesn't change the parameters inside the model: https://github.com/oobabooga/text-generation-webui/blob/main/modules/llamacpp_model.py#L43

Is the only way to change the model parameters to reload it from scratch like this?

_model = llamacpp.LlamaInference(params)

Mar 31 '23 22:03 oobabooga

@oobabooga That is a side effect of how the underlying Python bindings works right now. Adding support for changing those parameters when sampling from the logits is on my ToDo list. Right now, it is only possible if you use the LlamaContext class (which is more low level), instead of the higher-level LlamaInference which currently does not allow changing the parameters post-initialization. This is probably the next thing I will update in the library. I will post back here once that is done, or make a separate PR with the changes.

Mar 31 '23 22:03 thomasantony

is it possible to use vram and ram like this? https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

Apr 01 '23 22:04 niizam

@niizam 4-bit quantized models are already supported. You just need to use the appropriate weight files.

Apr 02 '23 16:04 thomasantony

text-generation-webui text-generation-webui copied to clipboard

Draft: Add support for llama.cpp

text-generation-webui
text-generation-webui copied to clipboard