litgpt Mistral Nemo 12B Checkpoints

According to some first reports, this new model works great. In case we have the time, it would be a nice model to add as it would fill the "multilingual" niche. (Some people have been asking about models for various non-English languages.) Not sure if Gemma-2 already fulfills that too though.

Jul 18 '24 19:07 rasbt

There is no modeling_*.py file in the repo. And the config.json file looks pretty standard.

It can be just a matter of adding a config.

Update: there is a custom tokenizer - tekken. So, yeah, might not be so easy 🙃.

Jul 19 '24 08:07 Andrei-Aksionov

In case we want to pursue this, some findings from Daniel Han:

My findings for Mistral NeMo 12b:

EOS token is untrained in base - a bug?
EOS token is auto appended 3, 4096, not 5120 for Wq
Not Llama Tokenizer
Tools, FIM
Pad_token=10
1M max RoPE pos: new dynamic RoPE in🦥 @UnslothAI saves 20GB VRAM

Longer notes:

EOS token is untrained in the base model but trained in instruct - confirming with @MistralAI if this is a feature or a bug - could make finetunes break with NaNs and infinities. Mistral 7b does not have this issue. Only the embed_tokens, not the lm_head has this issue.
EOS token is auto appended. This can break finetuning and inference - collabed with @xenovacom to fix this quickly :)

3, Not 5120 for Wq but 4096 - HF transformers main branch already has a fix for this - please update transformers! Unsloth auto patches, so no need to update!

Not a Llama Tokenizer - was GPT2 Tokenizer, now generic PreTrainedTokenizer? Very interesting! Tokenizer compresses other languages more efficiently.
Support for tools & FIM (Fill in the middle tasks). Function calling, code completion etc.
Pad_token=10 . A dedicated pad token - yay! Finetuning can break less with fewer infinite outputs :)
1 million possible position embeddings - had to support dynamic sizing of Cos & Sin cached matrices to not go OOM (used 20GB!)

More details in our blog: https://unsloth.ai/blog/mistral-nemo

Our free Colab notebook can finetune 12b in a free 16GB Tesla T4 GPU exactly, and can do it 2x faster and use 60% less VRAM than HF+FA2! https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing

We also have a Kaggle notebook making finetuning 2x faster: https://kaggle.com/code/danielhanchen/kaggle-mistral-nemo-12b-unsloth-notebook

Jul 19 '24 19:07 rasbt