unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

GGUF breaks - llama-3

Open danielhanchen opened this issue 9 months ago • 2 comments

Findings from https://github.com/ggerganov/llama.cpp/issues/7062 and Discord chats: Notebook for repro: https://colab.research.google.com/drive/1djwQGbEJtUEZo_OuqzN_JF6xSOUKhm4q?usp=sharing

  1. Unsloth + float16 + QLoRA = WORKS
  2. Unsloth + bfloat16 + QLoRA = WORKS
  3. Unsloth + bfloat16 + LoRA = WORKS
  4. Unsloth + float16 + QLoRA + GGUF-f16 = FAILS
  5. Unsloth + bfloat16 + LoRA + GGUF-f16 = FAILS

Todo:

  • [ ] HF directly + float16 + QLoRA + GGUF-f16
  • [x] HF directly + float16 + LoRA + GGUF-f16

danielhanchen avatar May 05 '24 18:05 danielhanchen

Update: Hi so I managed to test HF -> llama.cpp without Unsloth to remove Unsloth from the picture.

  1. '\n\n' is tokenized as [1734, 1734], unless if I prompted it incorrectly.
  2. [1734] using tokenizer.batch_decode([1734]) returns \\n.
  3. Ie llama.cpp is tokenizing \n\n as \\n\\n.
  4. Using HF directly, we get: \\n = 1734 \n = 198 \n\n = 271 \n\n\n = 1432 4*\n = 1038 5*\n = 14963 6*\n = 5244 7*\n = 35683 8*\n = 6087 9*\n = 55160

I used !python llama.cpp/convert-hf-to-gguf.py ./model --outfile ./model.f16.gguf --outtype f16 then !./llama.cpp/main -m ./model.f16.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors \ -p "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

See reproducible notebook: https://colab.research.google.com/drive/1aNS8CgXoJZHclBEW3ZjFfiLjpmqZ14KN?usp=sharing

Below is the comparison of tokenization differences between llama.cpp and HF: image

I also used convert.py which I'm assuming is not anyways supposed to work (maybe). I chose --vocab-type bpe. Reproducible example: https://colab.research.google.com/drive/1X8XBdLRf1-eRDSfcr_GrIhaf84Wp9FH1?usp=sharing

Sadly convert.py is even worse, splitting the newlines into 2 distinct characters: image

danielhanchen avatar May 06 '24 18:05 danielhanchen

Thanks for having looked into this. I've been suspicious of these \n's in llama.cpp since I noticed that when I added \n\n for llama 3's prompt, the Continuation would usually add a third one at the start of the reply for no obvious reason. What you're finding it probably the reason for that.

araleza avatar May 06 '24 19:05 araleza

It should be fixed!

danielhanchen avatar May 10 '24 19:05 danielhanchen