tokenizers [Potential Bug] Mistral Tokenizer Inconsistencies

I have downloaded the Mistral 7B tokenizer locally and tried to compare different combinations of the legacy and use_fast options:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=False)
tokenizerl = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=False)
tokenizerf = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=True)
tokenizerlf = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=True)

s = "test<unk> This is a test phrase</s>"

print(
  f'Regular:\t{tokenizer.tokenize(s)}\n\t\t{tokenizer.decode(tokenizer.encode(s))}'
)

print(
  f'Legacy:\t\t{tokenizerl.tokenize(s)}\n\t\t{tokenizerl.decode(tokenizerl.encode(s))}'
)

print(
  f'Fast:\t\t{tokenizerf.tokenize(s)}\n\t\t{tokenizerf.decode(tokenizerf.encode(s))}'
)

print(
  f'Legacy Fast:\t{tokenizerlf.tokenize(s)}\n\t\t{tokenizerlf.decode(tokenizerlf.encode(s))}'
)

Which yields:

Regular:        ['▁test', '<unk>', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk> This is a test phrase</s>
Legacy:         ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Fast:           ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Legacy Fast:    ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']

You can find the full code here.

There seem to be inconsistencies with how legacy=False, use_fast=False tokenizes input compared to the other options.

If either option is set to True, there is an extra space added after tokens like <unk> or other special tokens.

It seems to me that only legacy=False, use_fast=False tokenenizes this input correctly.

We have a production app that extends Mistral with other special tokens besides <unk>, and extra spaces are added after those too.

So right now, we have switched over to legacy=False, use_fast=False, not getting any of the speed advantages of the Rust implementation.

Would appreciate any insight to what we are missing! And thank you for the enormous amount of work you guys have put into this library 🙏

Feb 05 '24 20:02 komninoschatzipapas

Hey! Thanks, a fix can be derived from #1357 and https://github.com/huggingface/transformers/pull/26678. Everything you describe is mentioned there. TLDR; use metaspace with prepend_scheme="first" and no normalizer will be the end of you problems

Feb 06 '24 03:02 ArthurZucker

I have not had the time to change the default llama fast tokenizer, will try to do asap

Feb 06 '24 03:02 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 08 '24 01:03 github-actions[bot]

I think this is still relevant

Mar 08 '24 16:03 komninoschatzipapas

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 08 '24 01:04 github-actions[bot]

This was fixed in transformers you need to set legacy=False 🤗

Jun 11 '24 12:06 ArthurZucker

tokenizers tokenizers copied to clipboard

[Potential Bug] Mistral Tokenizer Inconsistencies

tokenizers
tokenizers copied to clipboard