Issue with Tokenizer (fast) splitting `<mask>` into constituent added special tokens despite mask token in vocab and in special tokens map
System Info
-
transformersversion: 4.25.1 - Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.31
- Python version: 3.10.6
- Huggingface_hub version: 0.11.0.rc0
- PyTorch version (GPU?): 1.13.0+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Steps to Reproduce Behavior:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./tok", use_fast= True)
tokenizer(tokenizer.mask_token, add_special_tokens=False)
Evaluates to {'input_ids': [11, 10], 'attention_mask': [1, 1]}
tokenizer_slow = AutoTokenizer.from_pretrained("./tok", use_fast= False)
tokenizer_slow(tokenizer_slow.mask_token, add_special_tokens=False)
Evaluates to {'input_ids': [4], 'attention_mask': [1]} (as expected).
Not that in either case, mask_token is <mask> and corresponds to mask_token_id 4.
Note also that the directory tok contains merges.txt, special_tokens_map.json, tokenizer_config.json, tokenizer.json, and vocab.json. Note that additional_special_tokens and vocab contain {... "m":11, "s":10 ,...}, so I believe the Rust tokenizer is considering these special tokens before considering the <mask> token.
Expected behavior
tokenizer_slow(tokenizer_slow.mask_token, add_special_tokens=False)['input_ids'] == tokenizer(tokenizer.mask_token, add_special_tokens=False)['input_ids'] == [4] would evaluate to True.
Interesting, this is part of a series of bug we have with different behaviours between fast and slow. Thanks for posting.
Thank you for your response @ArthurZucker . I would be happy to provide details about instantiation and behavior if needed.
Just to be able to reproduce correctly could you tell me which tokenizer are you using?
RobertaTokenizerFast
Could you push your tokenizer to the hub? I can't really reproduce this now
I also faced the same issue when trained using the ByteLevelBPETokenizer suggested in https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=IMnymRDLe0hi
Tokenizer training:
tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(iterator=LIST_OF_STRINGS, vocab_size=52000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
Tokenizer use:
tokenizer = RobertaTokenizerFast(vocab_file="<VOCAB_FILE_PATH>",
merges_file="<MERGES_FILE_PATH>",
max_len=512)
This tokenizer gives me: ['<s>', '<', 'mask', '>', '</s>'] when i use:
tokenizer.convert_ids_to_tokens(tokenizer.encode(tokenizer.mask_token))
is there a known fix to this ? I am using python 3.8, transformers 4.24.0 and tokenizers 0.13.1
I will have a look thanks 😉
This will be related to the tokenizer library as both reports include fast. Not stale!
Thanks for your patience 🤗
- In the current state, it is not a problem with the tokenizer itself as:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast= True)
tokenizer(tokenizer.mask_token, add_special_tokens=False)
correctly outputs 50264.
- Regarding the training of the tokenizer, the notebook works well for me and I cannot reproduce the issue that you are getting. Are you sure that you properly saved the vocabulary and merges with
tokenizer.save_model()(using the rust tokenizer) ?

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.