llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Add Nemotron/Minitron GGUF Conversion & Inference Support

Open suhara opened this issue 1 year ago • 2 comments

This PR adds HF->GGUF conversion & inference support for Nemotron models including Nemotron-3, Nemotron-4 and "Minitron" models.

The PR should support any Nemotron/Minitron models but has been primarily tested with the following Minitron model

HF support for Nemotron has been recently added and as of Transformers 4.44.0 Nemotron is supported (Thank you @Vaibhavs10 for the information!). You may need to install a newer version of the transformers library by running pip install transformers>=4.44.0.

Please see this PR for details.

The Nemotron architecture is similar to the Llama-2 architecture with a few key differences:

  • Vocabulary size: Nemotron uses 256k SentencePiece tokenizer
  • FFN layer: Nemotron uses Squared ReLU (up and down projections)
  • RoPE scheduling: Nemotron uses partial (50%) RoPE
  • Layer Normalization: Nemotron adds 1 to LayerNorm's weight for better numerical stability

You can find details about the model architecture in the following papers:



This PR was created in collaboration with @SpaceCowboy850, who is another contributor to this PR.

suhara avatar Aug 08 '24 07:08 suhara

Awesome! Thank you for sharing @Vaibhavs10! I've updated the original PR description.

suhara avatar Aug 08 '24 07:08 suhara

Thank you @compilade for the comments and suggestions! Committed changes accordingly.

suhara avatar Aug 09 '24 09:08 suhara

Hi @Vaibhavs10 , thanks for reviewing! I rebased it onto the latest main branch.

suhara avatar Aug 13 '24 00:08 suhara

Hi @compilade Can you take a look and see if it looks good to you? Thank you!

suhara avatar Aug 14 '24 20:08 suhara

Thank you all for your reviews and support @compilade @Vaibhavs10 @ggerganov @slaren !

Could anybody help merge this PR? Thank you!

suhara avatar Aug 15 '24 18:08 suhara

Sorry for disturbing, but when I try to convert the linked minitron-4b model with transformers 4.44.0 and current llama.cpp, it simply complains about missing tokenizer.model. Any idea why that could be?

schmorp avatar Aug 16 '24 08:08 schmorp

Hi @schmorp

I think the repo has been updated and tokenizer.model (in the sentencepiece format) is not hosted there anymore.

You can actually extract tokenizer.model from nemo/minitron-4b-base.nemo

$ cd minitron/nemo
$ tar -xf minitron-4b-base.nemo
$ ls
914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model
b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model
minitron-4b-base.nemo
model_config.yaml
model_weights
$ cp 914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model ../tokenizer.model

$ cd ../../
$ python convert_hf_to_gguf.py minitron --outtype f16 --outfile model.gguf

There are two tokenizer files but they are the same and either can be renamed astoknizer.model

  • 914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model
  • b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model

suhara avatar Aug 17 '24 09:08 suhara

@suhara thanks a lot!

schmorp avatar Aug 17 '24 13:08 schmorp

Minitron-8B converts, but then can't be used:

llm_load_tensors: ggml ctx size = 0.15 MiB llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1 llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '/tmp/Minitron-8B-Base.gguf' main : failed to init

schmorp avatar Aug 17 '24 14:08 schmorp

Minitron-4B seems to work. So it seems Minitron-8B is not quite supported yet.

schmorp avatar Aug 17 '24 14:08 schmorp

I'll look into this but I think I know the root cause.

8B uses head_dim: 128 and that may be the cause. https://huggingface.co/nvidia/Minitron-8B-Base/blob/main/config.json#L25

Many HF models including Llama asserts head_dim == hidden_size // num_attention_heads.

llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1

6144 = 128 * 48 so the conversion seems to be correct. The expectation (4096) is wrong.


FYI, for 4B, head_dim (128) == hidden_size (3072) // num_attention_heads (24) so this doesn't cause the issue.

suhara avatar Aug 17 '24 15:08 suhara

That's good news, thanks for looking into this. I'll have a try at the 340B.

schmorp avatar Aug 18 '24 02:08 schmorp

For the 340B, conversion instantly fails flat because there isn't a config.json file.

schmorp avatar Aug 18 '24 04:08 schmorp

I tried nvidia/Nemotron-4-340B-Instruct as well. Turns out even if you add a config.json the conversion results in a metadata only GGUF as all Nemotron-3 and Nemotron-4 models lack pytorch_model.bin or any safetensor files.

The only option seems to be using the SafeTensor conversion provided by @mgoin under https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5. He unfortunately never shared how he converted nemo into safetensor.

nicoboss avatar Aug 18 '24 19:08 nicoboss

@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!

mgoin avatar Aug 18 '24 19:08 mgoin

@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!

This would be absolutely awesome. Thanks a lot! I’m very interested in how the conversion works. Maybe it would even be possible to implement it inside convert_hf_to_gguf.py. I'm currently working together with @schmorp to GGUF quantize all Nemotron-3, Nemotron-4 and "Minitron" models. While your collection is great it unfortunately misses many Nemotron-3 models which we could convert by our own if you share your tools and knowledge. Nemotron-4-340B-Instruct is one of my favorite models and I can't thank you enough to convert it into a usable format.

nicoboss avatar Aug 18 '24 20:08 nicoboss

And just to document this here, Llama-3.1-Minitron-4B-Width-Base fails with:

cvs/llama.cpp/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed

schmorp avatar Aug 22 '24 09:08 schmorp