Mistral Nemo 12B Checkpoints
According to some first reports, this new model works great. In case we have the time, it would be a nice model to add as it would fill the "multilingual" niche. (Some people have been asking about models for various non-English languages.) Not sure if Gemma-2 already fulfills that too though.
There is no modeling_*.py file in the repo.
And the config.json file looks pretty standard.
It can be just a matter of adding a config.
Update: there is a custom tokenizer - tekken.
So, yeah, might not be so easy 🙃.
In case we want to pursue this, some findings from Daniel Han:
My findings for Mistral NeMo 12b:
- EOS token is untrained in base - a bug?
- EOS token is auto appended 3, 4096, not 5120 for Wq
- Not Llama Tokenizer
- Tools, FIM
- Pad_token=10
- 1M max RoPE pos: new dynamic RoPE in🦥 @UnslothAI saves 20GB VRAM
Longer notes:
-
EOS token is untrained in the base model but trained in instruct - confirming with @MistralAI if this is a feature or a bug - could make finetunes break with NaNs and infinities. Mistral 7b does not have this issue. Only the embed_tokens, not the lm_head has this issue.
-
EOS token is auto appended. This can break finetuning and inference - collabed with @xenovacom to fix this quickly :)
3, Not 5120 for Wq but 4096 - HF transformers main branch already has a fix for this - please update transformers! Unsloth auto patches, so no need to update!
-
Not a Llama Tokenizer - was GPT2 Tokenizer, now generic PreTrainedTokenizer? Very interesting! Tokenizer compresses other languages more efficiently.
-
Support for tools & FIM (Fill in the middle tasks). Function calling, code completion etc.
-
Pad_token=10
. A dedicated pad token - yay! Finetuning can break less with fewer infinite outputs :) -
1 million possible position embeddings - had to support dynamic sizing of Cos & Sin cached matrices to not go OOM (used 20GB!)
More details in our blog: https://unsloth.ai/blog/mistral-nemo
Our free Colab notebook can finetune 12b in a free 16GB Tesla T4 GPU exactly, and can do it 2x faster and use 60% less VRAM than HF+FA2! https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing
We also have a Kaggle notebook making finetuning 2x faster: https://kaggle.com/code/danielhanchen/kaggle-mistral-nemo-12b-unsloth-notebook