Add Nemotron/Minitron GGUF Conversion & Inference Support
This PR adds HF->GGUF conversion & inference support for Nemotron models including Nemotron-3, Nemotron-4 and "Minitron" models.
The PR should support any Nemotron/Minitron models but has been primarily tested with the following Minitron model
HF support for Nemotron has been recently added and as of Transformers 4.44.0 Nemotron is supported (Thank you @Vaibhavs10 for the information!). You may need to install a newer version of the transformers library by running pip install transformers>=4.44.0.
Please see this PR for details.
The Nemotron architecture is similar to the Llama-2 architecture with a few key differences:
- Vocabulary size: Nemotron uses 256k SentencePiece tokenizer
- FFN layer: Nemotron uses Squared ReLU (up and down projections)
- RoPE scheduling: Nemotron uses partial (50%) RoPE
- Layer Normalization: Nemotron adds 1 to LayerNorm's weight for better numerical stability
You can find details about the model architecture in the following papers:
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
This PR was created in collaboration with @SpaceCowboy850, who is another contributor to this PR.
Awesome! Thank you for sharing @Vaibhavs10! I've updated the original PR description.
Thank you @compilade for the comments and suggestions! Committed changes accordingly.
Hi @Vaibhavs10 , thanks for reviewing! I rebased it onto the latest main branch.
Hi @compilade Can you take a look and see if it looks good to you? Thank you!
Thank you all for your reviews and support @compilade @Vaibhavs10 @ggerganov @slaren !
Could anybody help merge this PR? Thank you!
Sorry for disturbing, but when I try to convert the linked minitron-4b model with transformers 4.44.0 and current llama.cpp, it simply complains about missing tokenizer.model. Any idea why that could be?
Hi @schmorp
I think the repo has been updated and tokenizer.model (in the sentencepiece format) is not hosted there anymore.
You can actually extract tokenizer.model from nemo/minitron-4b-base.nemo
$ cd minitron/nemo
$ tar -xf minitron-4b-base.nemo
$ ls
914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model
b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model
minitron-4b-base.nemo
model_config.yaml
model_weights
$ cp 914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model ../tokenizer.model
$ cd ../../
$ python convert_hf_to_gguf.py minitron --outtype f16 --outfile model.gguf
There are two tokenizer files but they are the same and either can be renamed astoknizer.model
-
914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model -
b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model
@suhara thanks a lot!
Minitron-8B converts, but then can't be used:
llm_load_tensors: ggml ctx size = 0.15 MiB llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1 llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '/tmp/Minitron-8B-Base.gguf' main : failed to init
Minitron-4B seems to work. So it seems Minitron-8B is not quite supported yet.
I'll look into this but I think I know the root cause.
8B uses head_dim: 128 and that may be the cause.
https://huggingface.co/nvidia/Minitron-8B-Base/blob/main/config.json#L25
Many HF models including Llama asserts head_dim == hidden_size // num_attention_heads.
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1
6144 = 128 * 48 so the conversion seems to be correct. The expectation (4096) is wrong.
FYI, for 4B, head_dim (128) == hidden_size (3072) // num_attention_heads (24) so this doesn't cause the issue.
That's good news, thanks for looking into this. I'll have a try at the 340B.
For the 340B, conversion instantly fails flat because there isn't a config.json file.
I tried nvidia/Nemotron-4-340B-Instruct as well. Turns out even if you add a config.json the conversion results in a metadata only GGUF as all Nemotron-3 and Nemotron-4 models lack pytorch_model.bin or any safetensor files.
The only option seems to be using the SafeTensor conversion provided by @mgoin under https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5. He unfortunately never shared how he converted nemo into safetensor.
@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!
@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!
This would be absolutely awesome. Thanks a lot! I’m very interested in how the conversion works. Maybe it would even be possible to implement it inside convert_hf_to_gguf.py. I'm currently working together with @schmorp to GGUF quantize all Nemotron-3, Nemotron-4 and "Minitron" models. While your collection is great it unfortunately misses many Nemotron-3 models which we could convert by our own if you share your tools and knowledge. Nemotron-4-340B-Instruct is one of my favorite models and I can't thank you enough to convert it into a usable format.
And just to document this here, Llama-3.1-Minitron-4B-Width-Base fails with:
cvs/llama.cpp/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed