mergekit icon indicating copy to clipboard operation
mergekit copied to clipboard

Fast tokenizer bad merging

Open Theodotus1243 opened this issue 2 years ago • 0 comments

As I see here The default tokenizer behaviour is set to use_fast

But in case of openchat/openchat-3.5-0106-gemma (maybe all gemma models) the initialization of tokenizer goes wrong and <|bos|> token index moves 1->255999, example here tokenizer_config.json

It is a problem of openchat model, but fast tokenizer for Llama always was problematic

So maybe it's reasonable to add flag in mergekit use_fast = true/false?

Theodotus1243 avatar Mar 19 '24 14:03 Theodotus1243