llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Error converting fine-tuned Llama2 7B model: Exception: Vocab size mismatch (model has 32000, but ../jarvis-hf/tokenizer.model has 32001).

Open FotieMConstant opened this issue 11 months ago • 4 comments

Hi everyone, i have an issue for days not using llama.cpp to convert a fine-tuned model and then quantize it. i am stuck at the conversion phase. when i use the command:

python llama.cpp/convert.py ../jarvis-hf --outtype f16 --outfile converted.bin

Here is the error i get:

Writing converted.bin, format 1
Traceback (most recent call last):
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 1466, in <module>
    main()
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 1460, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 1117, in write_all
    check_vocab_size(params, vocab, pad_vocab=pad_vocab)
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 963, in check_vocab_size
    raise Exception(msg)
Exception: Vocab size mismatch (model has 32000, but ../jarvis-hf/tokenizer.model has 32001).

Now, i am new to this whole fine-tuning thing and i am a little lost as to what the issue might be here:( I will add my Jupyter notebook code below and a working version of the model as well, the model is on huggingface

Fine-tuning code: https://colab.research.google.com/drive/1FTt_Z1eGOsl2VgPVb8pnM4yUTczhSutM?usp=sharing Working model: https://colab.research.google.com/drive/19ZuropXXc2_jMC_qxqa8MO4mHHxOqxxe?usp=sharing Model on hugginface: https://huggingface.co/fotiecodes/Llama-2-7b-chat-jarvis Original pre-trained model: https://huggingface.co/NousResearch/Llama-2-7b-chat-hf

To reproduce the issue: Download Llama-2-7b-chat-jarvis from huggingface and try to convert it with the convert.py from llama.cpp

Few things to note: When i print the get_vocab size it gives me 32001, so not sure why it isn't working though

OS: Mac OS Sonama, version 14.4 on Apple M1 chip llama.cpp: latest

FotieMConstant avatar Mar 17 '24 10:03 FotieMConstant

Looks like a broken model to me. Blame the author.

I could get a working result with --vocab-type hfft and patch below. No guarantees though.

diff --git a/added_tokens.json b/added_tokens.json
index 9c16aa4..0db3279 100644
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -1,3 +1,3 @@
 {
-  "<pad>": 32000
 }
diff --git a/tokenizer.json b/tokenizer.json
index ab74d1c..4afc6a4 100644
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -29,15 +29,6 @@
       "rstrip": false,
       "normalized": true,
       "special": true
-    },
-    {
-      "id": 32000,
-      "content": "<pad>",
-      "single_word": false,
-      "lstrip": false,
-      "rstrip": false,
-      "normalized": true,
-      "special": false
     }
   ],
   "normalizer": {

Artefact2 avatar Mar 17 '24 12:03 Artefact2

Hey @Artefact2 thanks for the heads-up I’ll try that. However, for more guarantee do you think it’s better to get and use the official base model from meta?

FotieMConstant avatar Mar 17 '24 14:03 FotieMConstant

Here, i am here with some feedback, @Artefact2. So, i just tried it and it works like charm, thanks. However, do you think it is better to request access to the original model from Meta. that could be better? maybe?

FotieMConstant avatar Mar 17 '24 15:03 FotieMConstant

So other than modifying the tokenizer.json file, is there another way to fix this? I am working with ChatMusician (based on llama2 7b) and seeing the exact same error...

petergreis avatar Apr 22 '24 15:04 petergreis

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 06 '24 01:06 github-actions[bot]