Support Mistral-Nemo-Instruct-2407 128K
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Any plans for support Mistral-Nemo-Instruct-2407 128K ?
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
Motivation
enhancement
Possible Implementation
No response
yes, please. this one is going to be good and soon finetunes will start to popup...
And this: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct
I second the request. This model is likely to become the reference for the 7-12b segment, and finetuning's version will indeed appear rapidly. Thx in advance
They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.
They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.
The issue is that it uses a custom tokenizer named Tekken. That's not an issue for any program that uses Transformers. As their tokenizer system supports the custom tokenizer. Which is why they call it a drop in replacement.
For llama.cpp however the custom tokenizer has to be implemented manually. And implementing new tokenizers correctly is usually not easy. Gemma-2 and Llama-3's tokenizer for instance took quite a while to implement properly, and it took multiple attempts to do so as bugs were found over time.
I actually think the tokenizer might not be too different from others. It's listed as GPT2Tokenizer in the tokenizer_config.json and it has a pre-tokenizer of the usual form. I was able to add it in the standard fashion with pre-tokenizer and the update script.
The other issue is that the tensors shapes relating to attention are not the sizes expected by the current implementation of Mistral (see my other comment here https://github.com/ggerganov/llama.cpp/discussions/8576#discussioncomment-10088864). I was able to brute force hack it into at least running, and I'm getting sensible output, which makes me think the tokenizer is doing okay. For example:
PROMPT: What is the capital of South Korea in Hangul?
RESPONSE: The capital of South Korea in Hangul is 서울 (Seoul).
If this model works well we should also try to add FP8 support into llama.cpp and make full use of the QAT. That will take more work to compute compared to Q8_0 without native FP8 support but it'll probably end up being memory bound anyways.
For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.
For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.
Can't wait to test ;)
Also, for those who are interested, chatllm.cpp supports this.
For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.
Seems to work perfectly so far. Nice job.
#8579 is merged
Just quantized Mistral-Nemo-Instruct and trying to run it I get the following error:
llm_load_tensors: ggml ctx size = 0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 5120, 5120, got 5120, 4096, 1, 1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Mistral-Nemo-Instruct-2407.Q8_0.gguf'
main: error: unable to load model
Looks like there's a shape mismatch.
According to the config file, the hidden size should be 5120 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L10
where I can find a proper gguf?
Hi @legraphista
I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:
llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range
Could you please let me know if I am missing something here?
Hi @legraphista
I have a new build from the main branch with the new PR merged, I am also using
convert_hf_to_gguf.py. But I am getting this error:llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t): IndexError: string index out of rangeCould you please let me know if I am missing something here?
Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.
@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.
Hi @legraphista I have a new build from the main branch with the new PR merged, I am also using
convert_hf_to_gguf.py. But I am getting this error:llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t): IndexError: string index out of rangeCould you please let me know if I am missing something here?
Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.
I am actually trying to quantize it at the moment, since I saw it happened successfully here, so I was wondering.
@mirek190 I have a
Q5_Kof Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.34 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 5120, 5120, got 5120, 4096, 1, 1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/new3/mistral-nemo-instruct-q5_k.gguf'
main: error: unable to load model
You version also doesn't work
@mirek190 try running a make clean first; the project makefiles don't appear to be 100% reliable.
I got @iamlemec 's commit working with his specific GGUF like so CompendiumLabs/mistral-nemo-instruct-2407-gguf
Didn't get it working with QuantFactory/Mistral-Nemo-Instruct-2407-GGUF
Looks like PR is incoming #8604
Also, for those who are interested, chatllm.cpp supports this.
Yeah, cool, but how do you run it, if the name of the model is nowhere to find? python chatllm.py -i -m :?????????
#8604 is merged!!
Uploading quants based on the latest master branch: https://huggingface.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF/discussions/2
Hope the post tests go well. Thank you all for making this happen!
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'tekken'
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
File "/root/Mistral-Nemo-Instruct-2407/chat.py", line 21, in <module>
generated_text = run_inference(model_path, prompt)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Mistral-Nemo-Instruct-2407/chat.py", line 5, in run_inference
llm = Llama(model_path=model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/env/lib/python3.11/site-packages/llama_cpp/llama.py", line 358, in __init__
self._model = self._stack.enter_context(contextlib.closing(_LlamaModel(
^^^^^^^^^^^^
File "/root/env/lib/python3.11/site-packages/llama_cpp/_internals.py", line 54, in __init__
raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: ggml-model-Q4_K_M.gguf
it's able to convert but still having issues with pre-tokenizer tekken using llama-cpp-python
Latest quants are available at https://huggingface.co/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF
I got @iamlemec 's commit working with his specific GGUF like so CompendiumLabs/mistral-nemo-instruct-2407-gguf
Didn't get it working with QuantFactory/Mistral-Nemo-Instruct-2407-GGUF
Looks like PR is incoming #8604
Updating the quants based on the latest release at: Mistral-Nemo-Instruct-2407-GGUF
Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 167772160032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'Mistral-Nemo-Instruct-2407.Q4_K_S.gguf'
ERR [ load_model] unable to load model | tid="140094037281280" timestamp=1721650251 model="Mistral-Nemo-Instruct-2407.Q4_K_S.gguf"
Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer
I have encountered the same problem while testing new build, the advice that helped me was to use -c parameter
Thanks confirmed, passing the context size explicitly does the trick. It seems to work correctly also when using "-c 131072" (128k if I'm not mistaken)
i did an ollama update and still getting the same error.