mlx-examples
mlx-examples copied to clipboard
Convert adds an additional token (= token missmatch to the base model)
When I run mlx_lm.convert for berkeley-nest/Starling-LM-7B-alpha my mlx model suddenly has 32003 instead of 32002 tokens. This creates issues if you want to train and later export a .gguf file via llama.cpp
python -m mlx_lm.convert \
--hf-path berkeley-nest/Starling-LM-7B-alpha \
--mlx-path /Volumes/T9/mlx_models/starling-lm7b-alpha-8bit \
-q \
--q-group-size 64 \
--q-bits 8 \
--dtype float16
Original models added_token.json (https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/added_tokens.json)
{
"<|end_of_turn|>": 32000,
"<|pad_0|>": 32001
}
added_token.json after converting it to mlx
{
"<sep>": 32002,
"<|end_of_turn|>": 32000,
"<|pad_0|>": 32001
}
I think there is something to do with the HF tokenizer behavior. I can see that <sep>
is in https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/tokenizer_config.json#L55. but it doesn't exist in https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/tokenizer.json. somehow it has been added as new special token.