llama.cpp Support Mistral-Nemo-Instruct-2407 128K

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Any plans for support Mistral-Nemo-Instruct-2407 128K ?

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407

Motivation

enhancement

Possible Implementation

No response

Jul 18 '24 19:07 mirek190

yes, please. this one is going to be good and soon finetunes will start to popup...

Jul 18 '24 20:07 0wwafa

And this: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

Jul 18 '24 20:07 0wwafa

I second the request. This model is likely to become the reference for the 7-12b segment, and finetuning's version will indeed appear rapidly. Thx in advance

Jul 18 '24 20:07 delphijb

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

Jul 18 '24 21:07 stduhpf

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

The issue is that it uses a custom tokenizer named Tekken. That's not an issue for any program that uses Transformers. As their tokenizer system supports the custom tokenizer. Which is why they call it a drop in replacement.

For llama.cpp however the custom tokenizer has to be implemented manually. And implementing new tokenizers correctly is usually not easy. Gemma-2 and Llama-3's tokenizer for instance took quite a while to implement properly, and it took multiple attempts to do so as bugs were found over time.

Jul 18 '24 21:07 EliEron

I actually think the tokenizer might not be too different from others. It's listed as GPT2Tokenizer in the tokenizer_config.json and it has a pre-tokenizer of the usual form. I was able to add it in the standard fashion with pre-tokenizer and the update script.

The other issue is that the tensors shapes relating to attention are not the sizes expected by the current implementation of Mistral (see my other comment here https://github.com/ggerganov/llama.cpp/discussions/8576#discussioncomment-10088864). I was able to brute force hack it into at least running, and I'm getting sensible output, which makes me think the tokenizer is doing okay. For example:

PROMPT: What is the capital of South Korea in Hangul?
RESPONSE: The capital of South Korea in Hangul is 서울 (Seoul).

Jul 18 '24 22:07 iamlemec

If this model works well we should also try to add FP8 support into llama.cpp and make full use of the QAT. That will take more work to compute compared to Q8_0 without native FP8 support but it'll probably end up being memory bound anyways.

Jul 19 '24 01:07 netrunnereve

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Jul 19 '24 21:07 iamlemec

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Can't wait to test ;)

Jul 19 '24 23:07 mirek190

Also, for those who are interested, chatllm.cpp supports this.

Jul 20 '24 09:07 foldl

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Seems to work perfectly so far. Nice job.

Jul 20 '24 13:07 stduhpf

#8579 is merged

Jul 20 '24 16:07 muhammadyusuf-kurbonov

Just quantized Mistral-Nemo-Instruct and trying to run it I get the following error:

llm_load_tensors: ggml ctx size =    0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Mistral-Nemo-Instruct-2407.Q8_0.gguf'
main: error: unable to load model

Looks like there's a shape mismatch.

According to the config file, the hidden size should be 5120 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L10

Jul 20 '24 16:07 legraphista

where I can find a proper gguf?

Jul 20 '24 18:07 mirek190

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

Jul 20 '24 19:07 maziyarpanahi

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:
llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range
Could you please let me know if I am missing something here?

Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

Jul 20 '24 19:07 stduhpf

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

Jul 20 '24 19:07 iamlemec

Hi @legraphista I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:
llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range
Could you please let me know if I am missing something here?
Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

I am actually trying to quantize it at the moment, since I saw it happened successfully here, so I was wondering.

Jul 20 '24 19:07 maziyarpanahi

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.34 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/new3/mistral-nemo-instruct-q5_k.gguf'
main: error: unable to load model

You version also doesn't work

Jul 20 '24 20:07 mirek190

@mirek190 try running a make clean first; the project makefiles don't appear to be 100% reliable.

Jul 20 '24 21:07 cwillu

I got @iamlemec 's commit working with his specific GGUF like so CompendiumLabs/mistral-nemo-instruct-2407-gguf

Didn't get it working with QuantFactory/Mistral-Nemo-Instruct-2407-GGUF

Looks like PR is incoming #8604

Jul 20 '24 21:07 ubergarm

Also, for those who are interested, chatllm.cpp supports this.

Yeah, cool, but how do you run it, if the name of the model is nowhere to find? python chatllm.py -i -m :?????????

Jul 21 '24 19:07 mlsterpr0

#8604 is merged!!

Jul 22 '24 09:07 muhammadyusuf-kurbonov

Uploading quants based on the latest master branch: https://huggingface.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF/discussions/2

Hope the post tests go well. Thank you all for making this happen!

Jul 22 '24 09:07 maziyarpanahi

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'tekken'
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "/root/Mistral-Nemo-Instruct-2407/chat.py", line 21, in <module>
    generated_text = run_inference(model_path, prompt)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Mistral-Nemo-Instruct-2407/chat.py", line 5, in run_inference
    llm = Llama(model_path=model_path)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/env/lib/python3.11/site-packages/llama_cpp/llama.py", line 358, in __init__
    self._model = self._stack.enter_context(contextlib.closing(_LlamaModel(
                                                               ^^^^^^^^^^^^
  File "/root/env/lib/python3.11/site-packages/llama_cpp/_internals.py", line 54, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: ggml-model-Q4_K_M.gguf

it's able to convert but still having issues with pre-tokenizer tekken using llama-cpp-python

Jul 22 '24 09:07 danilofalcao

Latest quants are available at https://huggingface.co/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF

I got @iamlemec 's commit working with his specific GGUF like so CompendiumLabs/mistral-nemo-instruct-2407-gguf

Didn't get it working with QuantFactory/Mistral-Nemo-Instruct-2407-GGUF

Looks like PR is incoming #8604

Updating the quants based on the latest release at: Mistral-Nemo-Instruct-2407-GGUF

Jul 22 '24 09:07 aashish-1904

Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 167772160032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'Mistral-Nemo-Instruct-2407.Q4_K_S.gguf'
 ERR [              load_model] unable to load model | tid="140094037281280" timestamp=1721650251 model="Mistral-Nemo-Instruct-2407.Q4_K_S.gguf"

Jul 22 '24 12:07 EricGrange

Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer

I have encountered the same problem while testing new build, the advice that helped me was to use -c parameter

Jul 22 '24 12:07 sbelenki

Thanks confirmed, passing the context size explicitly does the trick. It seems to work correctly also when using "-c 131072" (128k if I'm not mistaken)

Jul 22 '24 12:07 EricGrange

i did an ollama update and still getting the same error.

Jul 22 '24 13:07 jthack