llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Add Support for IBM Granite

Open YorkieDev opened this issue 9 months ago • 18 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ ✅] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [✅ ] I carefully followed the README.md.
  • [ ✅] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [✅ ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

IBM recently released their Granite models. A series of 3b -> 34b coding models with base and instruct finetunes.

https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330 https://github.com/ibm-granite

Many thanks to the llama.cpp community for their awesome work! It would be awesome to see this feature added. GGUF's can be made already, but when you try to load them you get a tokenizer error.

YorkieDev avatar May 07 '24 05:05 YorkieDev

The PR to add granite support for transformers (add MLP bias - gate, up, down) can be found here: https://github.com/huggingface/transformers/pull/30031/files

sroecker avatar May 07 '24 14:05 sroecker

Based on the discussion in transformers mlp_bias PR, It's similar to Llama with just the mlp_bias added

psyv282j9d avatar May 07 '24 20:05 psyv282j9d

I tried to do this here: https://github.com/sroecker/llama.cpp/tree/add_mlp_bias Just adding bias to FFN_GATE, FFN_DOWN and FFN_UP. The tensor shapes seem to be correct but the model outputs gibberish.

./main -m ~/Downloads/granite-3b-code-base.Q8_0.gguf -p "Question: Python code to calculate the Fibonacci series\n\nAnswer:\n" with the GGUF from https://huggingface.co/NikolayKozloff/granite-3b-code-instruct-Q8_0-GGUF (

sroecker avatar May 08 '24 08:05 sroecker

@sroecker are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models

mayank31398 avatar May 08 '24 08:05 mayank31398

Ah, not yet. Thanks! I guess then we need to define an additional ARCH (or save the mlp_bias boolean in the GGUF) and implement it like with MPT https://github.com/ggerganov/llama.cpp/blob/7e0b6a7b3ba94ff624dc27c1e0e735fded8819b8/llama.cpp#L5287

Mayank Mishra @.***> schrieb am Mi., 8. Mai 2024, 10:59:

@sroecker https://github.com/sroecker are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/7116#issuecomment-2100098327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACYR3PWJGC2D4PTSLE7D6DZBHSNZAVCNFSM6AAAAABHKJDCHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGA4TQMZSG4 . You are receiving this because you were mentioned.Message ID: @.***>

sroecker avatar May 08 '24 10:05 sroecker

(or save the mlp_bias boolean in the GGUF

Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.

JohnClaw avatar May 08 '24 10:05 JohnClaw

(or save the mlp_bias boolean in the GGUF

Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.

You could hack something with gguf writer https://pypi.org/project/gguf/

sroecker avatar May 08 '24 11:05 sroecker

So I've adapted build_llama to include the MLP biases as well. I've added a few FIXMEs to my branch to indicate places that might need to be adapted for the different Granite models. Now the output is valid text but unfortunately repeats itself: <|endoftext|>Question: Fibonacci series in Python? \n\nAnswer: Python? \n\n series in Python? \n\n series in Python? \n\n series in Python? \

sroecker avatar May 08 '24 14:05 sroecker

I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do

dataf3l avatar May 08 '24 22:05 dataf3l

I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do

+1 to this

adrianpuiu avatar May 08 '24 23:05 adrianpuiu

Is there any progress for the support of Granite models?

davideuler avatar May 12 '24 14:05 davideuler

AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.

jpodivin avatar May 12 '24 19:05 jpodivin

AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.

Yes, unfortunately. The lab version of granite works well with llama.cpp: https://huggingface.co/instructlab/granite-7b-lab-GGUF It doesn't have the MLP bias nodes and uses a different tokenizer though. I've tried a few things regarding tokenization. I checked that the tokenizer creates the same input tokens with ./main -m granite-3b-code-base.gguf -p "def generate():" -ngl 0 --override-kv tokenizer.ggml.add_bos_token=bool:false. 0 -> '<|endoftext|>' 589 -> 'def' 4450 -> ' generate' 2262 -> '():' I recreated the f16 GGUF forcing the pre tokenizer to be llama-bpe instead of refact. No game so far. There's a lot of ARCH specific code all over llama.cppwhich might change important parameters so I'm thinking about creating a simple debugging example based on examples/simple/simple.cpp.

sroecker avatar May 13 '24 07:05 sroecker

the lab version is a different model not to be confused with this one

mayank31398 avatar May 13 '24 11:05 mayank31398

the lab version is a different model not to be confused with this one

I'm aware of that, it did work out of the box with LLM_ARCH_LLAMA settings though so I'm trying to find out why exactly. But you're right to point this out, a few people mixed these up.

I will check the convert-hf-to-gguf-update.py script again to rule out the tokenizer before I start digging deeper.

sroecker avatar May 13 '24 14:05 sroecker

Hmm, a quick question: are we tying the word embeddings and output logits matrix? llama doesn't do that and granite has tied embeddings. maybe thats the issue? I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.

mayank31398 avatar May 13 '24 20:05 mayank31398

Hmm, a quick question: are we tying the word embeddings and output logits matrix? llama doesn't do that and granite has tied embeddings. maybe thats the issue? I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.

If no output layer is found the word embeddings are used instead: https://github.com/ggerganov/llama.cpp/blob/541600201e6480f54ae09e58d16b154d4b4b331d/llama.cpp#L4926-L4932

sroecker avatar May 14 '24 08:05 sroecker

Hmm, ok so there are these differences between llama and granite:

  1. attention has bias (llama doesn't)
  2. mlp has bias (llama doesn't)
  3. tied word embeddings (llama doesn't)
  4. starcoder tokenizer

mayank31398 avatar May 14 '24 15:05 mayank31398

Hmm, ok so there are these differences between llama and granite:

  1. attention has bias (llama doesn't)
  2. mlp has bias (llama doesn't)
  3. tied word embeddings (llama doesn't)
  4. starcoder tokenizer

Do all Granite code models use the starcoder tokenizer? Based on your HF repo comment I tried to get 20 and 34b to run. They are recognized as Starcoder arch by the convert-hf-to-gguf script and all I had to modify was to tie the embedding weights. 20b instruct works quite well, even with the bos token. The Q3_K_L quant comes down to 11GB. Please have a try with these changes: https://github.com/sroecker/llama.cpp/commit/6f201480de46aba0d5f718a2a8bdf424bd8e8274

For the 3 and 8b models 1) and 4) remain. We have to check if the attention bias is set up correctly in llm_build_kv, build_refact should be good for comparison.

sroecker avatar May 15 '24 21:05 sroecker

yeah all are using starcoder tokenizer.

mayank31398 avatar May 15 '24 21:05 mayank31398

If help for 8b-instruct model. After convert using

python3 llama.cpp/convert.py granite-8b-ins --outfile granite-8b-ins/granite-8b-instruct.bin --outtype q8_0 --vocab-type bpe --pad-vocab`

got this err when start ./llama.cpp/main -m ./granite-8b-ins/granite-8b-instruct.bin

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 578, got 470

The same numbers shows when using convert-hf-to-gguf.py.

During the conversion it shows 578

Last two lines INFO:convert:[577/578] Writing tensor blk.35.attn_v.weight | size 1024 x 4096 | type Q8_0 | T+ 84 INFO:convert:[578/578] Writing tensor output_norm.weight | size 4096 | type F32 | T+ 84

Tried with q8_0, f16 and f32 same err.

Thank you for this great work!

DigitLib avatar May 19 '24 19:05 DigitLib

I don't think that 3b and 8b are working yet @DigitLib the 34b and 20b PR is merged and its working: https://github.com/ggerganov/llama.cpp/pull/7324

20b-base GGUF is available now: https://huggingface.co/ibm-granite/granite-20b-code-base-GGUF I will add the instruct and 34b tomorrow

mayank31398 avatar May 20 '24 00:05 mayank31398

@mayank31398 I know just wanted to help with 8b-instruct. Thank you!

DigitLib avatar May 20 '24 10:05 DigitLib

@DigitLib you need https://github.com/sroecker/llama.cpp/commit/36dc5bbffe083545045ec2441ddc7f5c085d3caf to load the smaller models

giuseppe avatar May 20 '24 12:05 giuseppe

if that commit is working, can we open a PR @sroecker ?

mayank31398 avatar May 20 '24 13:05 mayank31398

that doesn't seem enough. The model is loaded but it doesn't produce any good result https://github.com/ggerganov/llama.cpp/issues/7116#issuecomment-2100061526

giuseppe avatar May 20 '24 13:05 giuseppe

https://huggingface.co/coder543/granite-20b-code-instruct-GGUF/tree/main

I've uploaded the q8_0, q6_K, and q4_0 gguf files for the 20B Instruct model here. I've only lightly tested them, and this is my first time quantizing any LLMs, but it seemed like they were working okay?

If anyone wants to test them, I'm curious if they work for you.

The chat template seems to be something like this:

Question:
Write a React TypeScript component

Answer:

coder543 avatar May 20 '24 14:05 coder543

I've managed to get some output that makes some sense with the 3b model, I've opened a PR:

  • https://github.com/ggerganov/llama.cpp/pull/7481

IMHO it makes sense to define a new architecture for granite, as there are substantial differences with the base llama model. To convert the hf model using the code in my PR, I modified the config.json file in the granite model and used:

  "architectures": [
    "GraniteForCausalLM"
  ],

@mayank31398 what do you think?

giuseppe avatar May 22 '24 23:05 giuseppe

@giuseppe did you get the 3b gguf working with #7481 ? if you teach me how I can get it locally on my M1 I can run some tests too :)

celek avatar May 23 '24 00:05 celek

To reproduce locally you can run the following:

  1. Clone down Giuseppe's branch, pip install necessary packages (e.g. torch, transformers, numpy, sentencepiece), and build Llama.cpp (i.e. run make)
  2. Download the 3B or 8B model from HF
  3. Modify the config.json per @giuseppe 's comment above (i.e. LlamaForCausalLM -> GraniteForCausalLM)
  4. Convert to GGUF (e.g. ./convert-hf-to-gguf.py path-to-granite-model --outtype q8_0 --outfile path-to-converted-model/converted-model-name.gguf)
  5. Run inference against the GGUF model (e.g. ./main -m path-to-converted-model/converted-model-name.gguf -p "Write a simple hello world script in python.")

Inference output should be something like the following (ignoring logging output for brevity):

print("Hello World")

HunterGerlach avatar May 23 '24 04:05 HunterGerlach