llama.cpp
llama.cpp copied to clipboard
Add Support for IBM Granite
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ ✅] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [✅ ] I carefully followed the README.md.
- [ ✅] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [✅ ] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
IBM recently released their Granite models. A series of 3b -> 34b coding models with base and instruct finetunes.
https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330 https://github.com/ibm-granite
Many thanks to the llama.cpp community for their awesome work! It would be awesome to see this feature added. GGUF's can be made already, but when you try to load them you get a tokenizer error.
The PR to add granite support for transformers (add MLP bias - gate, up, down) can be found here: https://github.com/huggingface/transformers/pull/30031/files
Based on the discussion in transformers mlp_bias PR, It's similar to Llama with just the mlp_bias
added
I tried to do this here: https://github.com/sroecker/llama.cpp/tree/add_mlp_bias Just adding bias to FFN_GATE, FFN_DOWN and FFN_UP. The tensor shapes seem to be correct but the model outputs gibberish.
./main -m ~/Downloads/granite-3b-code-base.Q8_0.gguf -p "Question: Python code to calculate the Fibonacci series\n\nAnswer:\n"
with the GGUF from https://huggingface.co/NikolayKozloff/granite-3b-code-instruct-Q8_0-GGUF (
@sroecker are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models
Ah, not yet. Thanks! I guess then we need to define an additional ARCH (or save the mlp_bias boolean in the GGUF) and implement it like with MPT https://github.com/ggerganov/llama.cpp/blob/7e0b6a7b3ba94ff624dc27c1e0e735fded8819b8/llama.cpp#L5287
Mayank Mishra @.***> schrieb am Mi., 8. Mai 2024, 10:59:
@sroecker https://github.com/sroecker are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/7116#issuecomment-2100098327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACYR3PWJGC2D4PTSLE7D6DZBHSNZAVCNFSM6AAAAABHKJDCHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGA4TQMZSG4 . You are receiving this because you were mentioned.Message ID: @.***>
(or save the mlp_bias boolean in the GGUF
Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.
(or save the mlp_bias boolean in the GGUF
Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.
You could hack something with gguf writer https://pypi.org/project/gguf/
So I've adapted build_llama to include the MLP biases as well. I've added a few FIXMEs to my branch to indicate places that might need to be adapted for the different Granite models.
Now the output is valid text but unfortunately repeats itself:
<|endoftext|>Question: Fibonacci series in Python? \n\nAnswer: Python? \n\n series in Python? \n\n series in Python? \n\n series in Python? \
I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do
I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do
+1 to this
Is there any progress for the support of Granite models?
AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.
AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.
Yes, unfortunately. The lab version of granite works well with llama.cpp: https://huggingface.co/instructlab/granite-7b-lab-GGUF It doesn't have the MLP bias nodes and uses a different tokenizer though.
I've tried a few things regarding tokenization. I checked that the tokenizer creates the same input tokens with ./main -m granite-3b-code-base.gguf -p "def generate():" -ngl 0 --override-kv tokenizer.ggml.add_bos_token=bool:false
.
0 -> '<|endoftext|>' 589 -> 'def' 4450 -> ' generate' 2262 -> '():'
I recreated the f16 GGUF forcing the pre tokenizer to be llama-bpe
instead of refact
. No game so far. There's a lot of ARCH specific code all over llama.cpp
which might change important parameters so I'm thinking about creating a simple debugging example based on examples/simple/simple.cpp
.
the lab version is a different model not to be confused with this one
the lab version is a different model not to be confused with this one
I'm aware of that, it did work out of the box with LLM_ARCH_LLAMA settings though so I'm trying to find out why exactly. But you're right to point this out, a few people mixed these up.
I will check the convert-hf-to-gguf-update.py
script again to rule out the tokenizer before I start digging deeper.
Hmm, a quick question: are we tying the word embeddings and output logits matrix? llama doesn't do that and granite has tied embeddings. maybe thats the issue? I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.
Hmm, a quick question: are we tying the word embeddings and output logits matrix? llama doesn't do that and granite has tied embeddings. maybe thats the issue? I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.
If no output layer is found the word embeddings are used instead: https://github.com/ggerganov/llama.cpp/blob/541600201e6480f54ae09e58d16b154d4b4b331d/llama.cpp#L4926-L4932
Hmm, ok so there are these differences between llama and granite:
- attention has bias (llama doesn't)
- mlp has bias (llama doesn't)
- tied word embeddings (llama doesn't)
- starcoder tokenizer
Hmm, ok so there are these differences between llama and granite:
- attention has bias (llama doesn't)
- mlp has bias (llama doesn't)
- tied word embeddings (llama doesn't)
- starcoder tokenizer
Do all Granite code models use the starcoder tokenizer? Based on your HF repo comment I tried to get 20 and 34b to run. They are recognized as Starcoder arch by the convert-hf-to-gguf script and all I had to modify was to tie the embedding weights. 20b instruct works quite well, even with the bos token. The Q3_K_L quant comes down to 11GB. Please have a try with these changes: https://github.com/sroecker/llama.cpp/commit/6f201480de46aba0d5f718a2a8bdf424bd8e8274
For the 3 and 8b models 1) and 4) remain. We have to check if the attention bias is set up correctly in llm_build_kv
, build_refact
should be good for comparison.
yeah all are using starcoder tokenizer.
If help for 8b-instruct model. After convert using
python3
llama.cpp/convert.py granite-8b-ins --outfile granite-8b-ins/granite-8b-instruct.bin --outtype q8_0 --vocab-type bpe --pad-vocab`
got this err when start ./llama.cpp/main -m ./granite-8b-ins/granite-8b-instruct.bin
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 578, got 470
The same numbers shows when using convert-hf-to-gguf.py.
During the conversion it shows 578
Last two lines INFO:convert:[577/578] Writing tensor blk.35.attn_v.weight | size 1024 x 4096 | type Q8_0 | T+ 84 INFO:convert:[578/578] Writing tensor output_norm.weight | size 4096 | type F32 | T+ 84
Tried with q8_0, f16 and f32 same err.
Thank you for this great work!
I don't think that 3b and 8b are working yet @DigitLib the 34b and 20b PR is merged and its working: https://github.com/ggerganov/llama.cpp/pull/7324
20b-base GGUF is available now: https://huggingface.co/ibm-granite/granite-20b-code-base-GGUF I will add the instruct and 34b tomorrow
@mayank31398 I know just wanted to help with 8b-instruct. Thank you!
@DigitLib you need https://github.com/sroecker/llama.cpp/commit/36dc5bbffe083545045ec2441ddc7f5c085d3caf to load the smaller models
if that commit is working, can we open a PR @sroecker ?
that doesn't seem enough. The model is loaded but it doesn't produce any good result https://github.com/ggerganov/llama.cpp/issues/7116#issuecomment-2100061526
https://huggingface.co/coder543/granite-20b-code-instruct-GGUF/tree/main
I've uploaded the q8_0, q6_K, and q4_0 gguf files for the 20B Instruct model here. I've only lightly tested them, and this is my first time quantizing any LLMs, but it seemed like they were working okay?
If anyone wants to test them, I'm curious if they work for you.
The chat template seems to be something like this:
Question:
Write a React TypeScript component
Answer:
I've managed to get some output that makes some sense with the 3b model, I've opened a PR:
- https://github.com/ggerganov/llama.cpp/pull/7481
IMHO it makes sense to define a new architecture for granite, as there are substantial differences with the base llama model. To convert the hf model using the code in my PR, I modified the config.json file in the granite model and used:
"architectures": [
"GraniteForCausalLM"
],
@mayank31398 what do you think?
@giuseppe did you get the 3b gguf working with #7481 ? if you teach me how I can get it locally on my M1 I can run some tests too :)
To reproduce locally you can run the following:
- Clone down Giuseppe's branch, pip install necessary packages (e.g. torch, transformers, numpy, sentencepiece), and build Llama.cpp (i.e. run
make
) - Download the 3B or 8B model from HF
- Modify the config.json per @giuseppe 's comment above (i.e.
LlamaForCausalLM
->GraniteForCausalLM
) - Convert to GGUF (e.g.
./convert-hf-to-gguf.py path-to-granite-model --outtype q8_0 --outfile path-to-converted-model/converted-model-name.gguf
) - Run inference against the GGUF model (e.g.
./main -m path-to-converted-model/converted-model-name.gguf -p "Write a simple hello world script in python."
)
Inference output should be something like the following (ignoring logging output for brevity):
print("Hello World")