llama.cpp Add Nemotron/Minitron GGUF Conversion & Inference Support

This PR adds HF->GGUF conversion & inference support for Nemotron models including Nemotron-3, Nemotron-4 and "Minitron" models.

The PR should support any Nemotron/Minitron models but has been primarily tested with the following Minitron model

Minitron 4B

HF support for Nemotron has been recently added and as of Transformers 4.44.0 Nemotron is supported (Thank you @Vaibhavs10 for the information!). You may need to install a newer version of the transformers library by running pip install transformers>=4.44.0.

Please see this PR for details.

The Nemotron architecture is similar to the Llama-2 architecture with a few key differences:

Vocabulary size: Nemotron uses 256k SentencePiece tokenizer
FFN layer: Nemotron uses Squared ReLU (up and down projections)
RoPE scheduling: Nemotron uses partial (50%) RoPE
Layer Normalization: Nemotron adds 1 to LayerNorm's weight for better numerical stability

You can find details about the model architecture in the following papers:

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

This PR was created in collaboration with @SpaceCowboy850, who is another contributor to this PR.

Aug 08 '24 07:08 suhara

Awesome! Thank you for sharing @Vaibhavs10! I've updated the original PR description.

Aug 08 '24 07:08 suhara

Thank you @compilade for the comments and suggestions! Committed changes accordingly.

Aug 09 '24 09:08 suhara

Hi @Vaibhavs10 , thanks for reviewing! I rebased it onto the latest main branch.

Aug 13 '24 00:08 suhara

Hi @compilade Can you take a look and see if it looks good to you? Thank you!

Aug 14 '24 20:08 suhara

Thank you all for your reviews and support @compilade @Vaibhavs10 @ggerganov @slaren !

Could anybody help merge this PR? Thank you!

Aug 15 '24 18:08 suhara

Sorry for disturbing, but when I try to convert the linked minitron-4b model with transformers 4.44.0 and current llama.cpp, it simply complains about missing tokenizer.model. Any idea why that could be?

Aug 16 '24 08:08 schmorp

Hi @schmorp

I think the repo has been updated and tokenizer.model (in the sentencepiece format) is not hosted there anymore.

You can actually extract tokenizer.model from nemo/minitron-4b-base.nemo

$ cd minitron/nemo
$ tar -xf minitron-4b-base.nemo
$ ls
914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model
b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model
minitron-4b-base.nemo
model_config.yaml
model_weights
$ cp 914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model ../tokenizer.model

$ cd ../../
$ python convert_hf_to_gguf.py minitron --outtype f16 --outfile model.gguf

There are two tokenizer files but they are the same and either can be renamed astoknizer.model

914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model
b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model

Aug 17 '24 09:08 suhara

@suhara thanks a lot!

Aug 17 '24 13:08 schmorp

Minitron-8B converts, but then can't be used:

llm_load_tensors: ggml ctx size = 0.15 MiB llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1 llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '/tmp/Minitron-8B-Base.gguf' main : failed to init

Aug 17 '24 14:08 schmorp

Minitron-4B seems to work. So it seems Minitron-8B is not quite supported yet.

Aug 17 '24 14:08 schmorp

I'll look into this but I think I know the root cause.

8B uses head_dim: 128 and that may be the cause. https://huggingface.co/nvidia/Minitron-8B-Base/blob/main/config.json#L25

Many HF models including Llama asserts head_dim == hidden_size // num_attention_heads.

llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1

6144 = 128 * 48 so the conversion seems to be correct. The expectation (4096) is wrong.

FYI, for 4B, head_dim (128) == hidden_size (3072) // num_attention_heads (24) so this doesn't cause the issue.

Aug 17 '24 15:08 suhara

That's good news, thanks for looking into this. I'll have a try at the 340B.

Aug 18 '24 02:08 schmorp

For the 340B, conversion instantly fails flat because there isn't a config.json file.

Aug 18 '24 04:08 schmorp

I tried nvidia/Nemotron-4-340B-Instruct as well. Turns out even if you add a config.json the conversion results in a metadata only GGUF as all Nemotron-3 and Nemotron-4 models lack pytorch_model.bin or any safetensor files.

The only option seems to be using the SafeTensor conversion provided by @mgoin under https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5. He unfortunately never shared how he converted nemo into safetensor.

Aug 18 '24 19:08 nicoboss

@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!

Aug 18 '24 19:08 mgoin

@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!

This would be absolutely awesome. Thanks a lot! I’m very interested in how the conversion works. Maybe it would even be possible to implement it inside convert_hf_to_gguf.py. I'm currently working together with @schmorp to GGUF quantize all Nemotron-3, Nemotron-4 and "Minitron" models. While your collection is great it unfortunately misses many Nemotron-3 models which we could convert by our own if you share your tools and knowledge. Nemotron-4-340B-Instruct is one of my favorite models and I can't thank you enough to convert it into a usable format.

Aug 18 '24 20:08 nicoboss

And just to document this here, Llama-3.1-Minitron-4B-Width-Base fails with:

cvs/llama.cpp/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed

Aug 22 '24 09:08 schmorp