tokenizers
tokenizers copied to clipboard
[Potential Bug] Mistral Tokenizer Inconsistencies
I have downloaded the Mistral 7B tokenizer locally and tried to compare different combinations of the legacy
and use_fast
options:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=False)
tokenizerl = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=False)
tokenizerf = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=True)
tokenizerlf = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=True)
s = "test<unk> This is a test phrase</s>"
print(
f'Regular:\t{tokenizer.tokenize(s)}\n\t\t{tokenizer.decode(tokenizer.encode(s))}'
)
print(
f'Legacy:\t\t{tokenizerl.tokenize(s)}\n\t\t{tokenizerl.decode(tokenizerl.encode(s))}'
)
print(
f'Fast:\t\t{tokenizerf.tokenize(s)}\n\t\t{tokenizerf.decode(tokenizerf.encode(s))}'
)
print(
f'Legacy Fast:\t{tokenizerlf.tokenize(s)}\n\t\t{tokenizerlf.decode(tokenizerlf.encode(s))}'
)
Which yields:
Regular: ['▁test', '<unk>', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
<s> test<unk> This is a test phrase</s>
Legacy: ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
<s> test<unk> This is a test phrase</s>
Fast: ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
<s> test<unk> This is a test phrase</s>
Legacy Fast: ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
You can find the full code here.
There seem to be inconsistencies with how legacy=False, use_fast=False
tokenizes input compared to the other options.
If either option is set to True
, there is an extra space added after tokens like <unk>
or other special tokens.
It seems to me that only legacy=False, use_fast=False
tokenenizes this input correctly.
We have a production app that extends Mistral with other special tokens besides <unk>
, and extra spaces are added after those too.
So right now, we have switched over to legacy=False, use_fast=False
, not getting any of the speed advantages of the Rust implementation.
Would appreciate any insight to what we are missing! And thank you for the enormous amount of work you guys have put into this library 🙏
Hey! Thanks, a fix can be derived from #1357 and https://github.com/huggingface/transformers/pull/26678.
Everything you describe is mentioned there. TLDR; use metaspace with prepend_scheme="first"
and no normalizer will be the end of you problems
I have not had the time to change the default llama fast tokenizer, will try to do asap
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I think this is still relevant
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This was fixed in transformers
you need to set legacy=False
🤗