mergekit
mergekit copied to clipboard
Fast tokenizer bad merging
As I see here
The default tokenizer behaviour is set to use_fast
But in case of openchat/openchat-3.5-0106-gemma (maybe all gemma models) the initialization of tokenizer goes wrong and <|bos|> token index moves 1->255999, example here tokenizer_config.json
It is a problem of openchat model, but fast tokenizer for Llama always was problematic
So maybe it's reasonable to add flag in mergekit use_fast = true/false?