mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Convert adds an additional token (= token missmatch to the base model)

Open ai-made-approachable opened this issue 5 months ago • 1 comments

When I run mlx_lm.convert for berkeley-nest/Starling-LM-7B-alpha my mlx model suddenly has 32003 instead of 32002 tokens. This creates issues if you want to train and later export a .gguf file via llama.cpp

python -m mlx_lm.convert \
--hf-path berkeley-nest/Starling-LM-7B-alpha \
--mlx-path /Volumes/T9/mlx_models/starling-lm7b-alpha-8bit \
-q \
--q-group-size 64 \
--q-bits 8 \
--dtype float16

Original models added_token.json (https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/added_tokens.json)

{
  "<|end_of_turn|>": 32000,
  "<|pad_0|>": 32001
}

added_token.json after converting it to mlx

{
  "<sep>": 32002,
  "<|end_of_turn|>": 32000,
  "<|pad_0|>": 32001
}

ai-made-approachable avatar Mar 07 '24 06:03 ai-made-approachable

I think there is something to do with the HF tokenizer behavior. I can see that <sep> is in https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/tokenizer_config.json#L55. but it doesn't exist in https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/tokenizer.json. somehow it has been added as new special token.

mzbac avatar Mar 07 '24 07:03 mzbac