unsloth GGUF breaks

Findings from https://github.com/ggerganov/llama.cpp/issues/7062 and Discord chats: Notebook for repro: https://colab.research.google.com/drive/1djwQGbEJtUEZo_OuqzN_JF6xSOUKhm4q?usp=sharing

Unsloth + float16 + QLoRA = WORKS
Unsloth + bfloat16 + QLoRA = WORKS
Unsloth + bfloat16 + LoRA = WORKS
Unsloth + float16 + QLoRA + GGUF-f16 = FAILS
Unsloth + bfloat16 + LoRA + GGUF-f16 = FAILS

Todo:

[ ] HF directly + float16 + QLoRA + GGUF-f16
[x] HF directly + float16 + LoRA + GGUF-f16

May 05 '24 18:05 danielhanchen

Update: Hi so I managed to test HF -> llama.cpp without Unsloth to remove Unsloth from the picture.

'\n\n' is tokenized as [1734, 1734], unless if I prompted it incorrectly.
[1734] using tokenizer.batch_decode([1734]) returns \\n.
Ie llama.cpp is tokenizing \n\n as \\n\\n.
Using HF directly, we get: \\n = 1734 \n = 198 \n\n = 271 \n\n\n = 1432 4*\n = 1038 5*\n = 14963 6*\n = 5244 7*\n = 35683 8*\n = 6087 9*\n = 55160

See reproducible notebook: https://colab.research.google.com/drive/1aNS8CgXoJZHclBEW3ZjFfiLjpmqZ14KN?usp=sharing

Below is the comparison of tokenization differences between llama.cpp and HF:

I also used convert.py which I'm assuming is not anyways supposed to work (maybe). I chose --vocab-type bpe. Reproducible example: https://colab.research.google.com/drive/1X8XBdLRf1-eRDSfcr_GrIhaf84Wp9FH1?usp=sharing

Sadly convert.py is even worse, splitting the newlines into 2 distinct characters:

May 06 '24 18:05 danielhanchen

Thanks for having looked into this. I've been suspicious of these \n's in llama.cpp since I noticed that when I added \n\n for llama 3's prompt, the Continuation would usually add a third one at the start of the reply for no obvious reason. What you're finding it probably the reason for that.

May 06 '24 19:05 araleza

It should be fixed!

May 10 '24 19:05 danielhanchen

unsloth
unsloth copied to clipboard

GGUF breaks - llama-3

unsloth unsloth copied to clipboard

GGUF breaks - llama-3

unsloth
unsloth copied to clipboard