llama.cpp
llama.cpp copied to clipboard
Tokenizer fixes
More tokenizer fixes.
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [x] Low
- [ ] Medium
- [ ] High
Examples of vocab differences:
INFO VOCABFILE: './models/ggml-vocab-t5.gguf'
ERROR detokenize=False id=32000 expected='<extra_id_99>' result='[PAD32000]'
ERROR detokenize=False id=32001 expected='<extra_id_98>' result='[PAD32001]'
ERROR detokenize=False id=32002 expected='<extra_id_97>' result='[PAD32002]'
ERROR detokenize=False id=32003 expected='<extra_id_96>' result='[PAD32003]'
ERROR detokenize=False id=32004 expected='<extra_id_95>' result='[PAD32004]'
ERROR detokenize=False id=32005 expected='<extra_id_94>' result='[PAD32005]'
ERROR detokenize=False id=32006 expected='<extra_id_93>' result='[PAD32006]'
ERROR detokenize=False id=32007 expected='<extra_id_92>' result='[PAD32007]'
ERROR detokenize=False id=32008 expected='<extra_id_91>' result='[PAD32008]'
ERROR detokenize=False id=32009 expected='<extra_id_90>' result='[PAD32009]'
INFO VOCABFILE: './models/ggml-vocab-deepseek-llm.gguf'
ERROR detokenize=True id=100002 expected='�' result='ø'
ERROR detokenize=True id=100003 expected='�' result='ö'
ERROR detokenize=True id=100004 expected='�' result='ú'
ERROR detokenize=True id=100005 expected='�' result='ÿ'
ERROR detokenize=True id=100006 expected='�' result='õ'
ERROR detokenize=True id=100007 expected='�' result='÷'
ERROR detokenize=True id=100008 expected='�' result='û'
ERROR detokenize=True id=100009 expected='�' result='ý'
ERROR detokenize=True id=100010 expected='�' result='À'
ERROR detokenize=True id=100011 expected='�' result='ù'
INFO VOCABFILE: './models/ggml-vocab-command-r.gguf'
ERROR detokenize=True id=264 expected='\u200d' result='[UNK_BYTE_0xe2808d\u200d]'
ERROR detokenize=True id=265 expected='‼' result='[UNK_BYTE_0xe280bc‼]'
ERROR detokenize=True id=266 expected='⁉' result='[UNK_BYTE_0xe28189⁉]'
ERROR detokenize=True id=267 expected='⃣' result='[UNK_BYTE_0xe283a3⃣]'
ERROR detokenize=True id=268 expected='™' result='[UNK_BYTE_0xe284a2™]'
ERROR detokenize=True id=269 expected='ℹ' result='[UNK_BYTE_0xe284b9ℹ]'
ERROR detokenize=True id=270 expected='↔' result='[UNK_BYTE_0xe28694↔]'
ERROR detokenize=True id=271 expected='↕' result='[UNK_BYTE_0xe28695↕]'
ERROR detokenize=True id=272 expected='↖' result='[UNK_BYTE_0xe28696↖]'
ERROR detokenize=True id=273 expected='↗' result='[UNK_BYTE_0xe28697↗]'
INFO VOCABFILE: './models/ggml-vocab-deepseek-llm.gguf' ERROR detokenize=True id=100002 expected='�' result='ø' ERROR detokenize=True id=100003 expected='�' result='ö' ERROR detokenize=True id=100004 expected='�' result='ú' ERROR detokenize=True id=100005 expected='�' result='ÿ' ERROR detokenize=True id=100006 expected='�' result='õ' ERROR detokenize=True id=100007 expected='�' result='÷' ERROR detokenize=True id=100008 expected='�' result='û' ERROR detokenize=True id=100009 expected='�' result='ý' ERROR detokenize=True id=100010 expected='�' result='À' ERROR detokenize=True id=100011 expected='�' result='ù'
These are part of the added_tokens of deepseek-llm, and are exactly as in result. Not sure where expected takes its tokens, but this is not correct if it doesn't take into account the added_tokens.
INFO VOCABFILE: './models/ggml-vocab-t5.gguf' ERROR detokenize=False id=32000 expected='<extra_id_99>' result='[PAD32000]' ERROR detokenize=False id=32001 expected='<extra_id_98>' result='[PAD32001]' ERROR detokenize=False id=32002 expected='<extra_id_97>' result='[PAD32002]' ...
These are also part of the added tokens (of t5), but in this case it's llama.cpp which is wrong. This does seem useful for debugging the convert script(s)!