transformers Issue with Tokenizer (fast) splitting `<mask>` into constituent added special tokens despite mask token in vocab and in special tokens map

System Info

transformers version: 4.25.1
Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.31
Python version: 3.10.6
Huggingface_hub version: 0.11.0.rc0
PyTorch version (GPU?): 1.13.0+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Steps to Reproduce Behavior:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./tok", use_fast= True)
tokenizer(tokenizer.mask_token, add_special_tokens=False)

Evaluates to {'input_ids': [11, 10], 'attention_mask': [1, 1]}

tokenizer_slow = AutoTokenizer.from_pretrained("./tok", use_fast= False)
tokenizer_slow(tokenizer_slow.mask_token, add_special_tokens=False)

Evaluates to {'input_ids': [4], 'attention_mask': [1]} (as expected).

Not that in either case, mask_token is <mask> and corresponds to mask_token_id 4.

Note also that the directory tok contains merges.txt, special_tokens_map.json, tokenizer_config.json, tokenizer.json, and vocab.json. Note that additional_special_tokens and vocab contain {... "m":11, "s":10 ,...}, so I believe the Rust tokenizer is considering these special tokens before considering the <mask> token.

Expected behavior

tokenizer_slow(tokenizer_slow.mask_token, add_special_tokens=False)['input_ids'] == tokenizer(tokenizer.mask_token, add_special_tokens=False)['input_ids'] == [4] would evaluate to True.

Dec 12 '22 21:12 simonlevine

Interesting, this is part of a series of bug we have with different behaviours between fast and slow. Thanks for posting.

Dec 13 '22 09:12 ArthurZucker

Thank you for your response @ArthurZucker . I would be happy to provide details about instantiation and behavior if needed.

Dec 14 '22 00:12 simonlevine

Just to be able to reproduce correctly could you tell me which tokenizer are you using?

Dec 14 '22 06:12 ArthurZucker

RobertaTokenizerFast

Dec 14 '22 19:12 simonlevine

Could you push your tokenizer to the hub? I can't really reproduce this now

Dec 20 '22 14:12 ArthurZucker

I also faced the same issue when trained using the ByteLevelBPETokenizer suggested in https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=IMnymRDLe0hi

Tokenizer training:

tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(iterator=LIST_OF_STRINGS, vocab_size=52000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Tokenizer use:

tokenizer = RobertaTokenizerFast(vocab_file="<VOCAB_FILE_PATH>",
                                 merges_file="<MERGES_FILE_PATH>",
                                 max_len=512)

This tokenizer gives me: ['<s>', '<', 'mask', '>', '</s>'] when i use:

tokenizer.convert_ids_to_tokens(tokenizer.encode(tokenizer.mask_token))

is there a known fix to this ? I am using python 3.8, transformers 4.24.0 and tokenizers 0.13.1

Jan 10 '23 18:01 adytiwari

I will have a look thanks 😉

Jan 18 '23 11:01 ArthurZucker

This will be related to the tokenizer library as both reports include fast. Not stale!

Mar 09 '23 17:03 ArthurZucker

Thanks for your patience 🤗

In the current state, it is not a problem with the tokenizer itself as:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast= True)
tokenizer(tokenizer.mask_token, add_special_tokens=False)

correctly outputs 50264.

Regarding the training of the tokenizer, the notebook works well for me and I cannot reproduce the issue that you are getting. Are you sure that you properly saved the vocabulary and merges with tokenizer.save_model() (using the rust tokenizer) ?

Mar 29 '23 09:03 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 22 '23 15:04 github-actions[bot]