transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Issue with Tokenizer (fast) splitting `<mask>` into constituent added special tokens despite mask token in vocab and in special tokens map

Open simonlevine opened this issue 3 years ago • 10 comments

System Info

  • transformers version: 4.25.1
  • Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.31
  • Python version: 3.10.6
  • Huggingface_hub version: 0.11.0.rc0
  • PyTorch version (GPU?): 1.13.0+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Steps to Reproduce Behavior:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./tok", use_fast= True)
tokenizer(tokenizer.mask_token, add_special_tokens=False) 

Evaluates to {'input_ids': [11, 10], 'attention_mask': [1, 1]}

tokenizer_slow = AutoTokenizer.from_pretrained("./tok", use_fast= False)
tokenizer_slow(tokenizer_slow.mask_token, add_special_tokens=False)

Evaluates to {'input_ids': [4], 'attention_mask': [1]} (as expected).

Not that in either case, mask_token is <mask> and corresponds to mask_token_id 4.

Note also that the directory tok contains merges.txt, special_tokens_map.json, tokenizer_config.json, tokenizer.json, and vocab.json. Note that additional_special_tokens and vocab contain {... "m":11, "s":10 ,...}, so I believe the Rust tokenizer is considering these special tokens before considering the <mask> token.

Expected behavior

tokenizer_slow(tokenizer_slow.mask_token, add_special_tokens=False)['input_ids'] == tokenizer(tokenizer.mask_token, add_special_tokens=False)['input_ids'] == [4] would evaluate to True.

simonlevine avatar Dec 12 '22 21:12 simonlevine

Interesting, this is part of a series of bug we have with different behaviours between fast and slow. Thanks for posting.

ArthurZucker avatar Dec 13 '22 09:12 ArthurZucker

Thank you for your response @ArthurZucker . I would be happy to provide details about instantiation and behavior if needed.

simonlevine avatar Dec 14 '22 00:12 simonlevine

Just to be able to reproduce correctly could you tell me which tokenizer are you using?

ArthurZucker avatar Dec 14 '22 06:12 ArthurZucker

RobertaTokenizerFast

simonlevine avatar Dec 14 '22 19:12 simonlevine

Could you push your tokenizer to the hub? I can't really reproduce this now

ArthurZucker avatar Dec 20 '22 14:12 ArthurZucker

I also faced the same issue when trained using the ByteLevelBPETokenizer suggested in https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=IMnymRDLe0hi

Tokenizer training:

tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(iterator=LIST_OF_STRINGS, vocab_size=52000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Tokenizer use:

tokenizer = RobertaTokenizerFast(vocab_file="<VOCAB_FILE_PATH>",
                                 merges_file="<MERGES_FILE_PATH>",
                                 max_len=512)

This tokenizer gives me: ['<s>', '<', 'mask', '>', '</s>'] when i use:

tokenizer.convert_ids_to_tokens(tokenizer.encode(tokenizer.mask_token))

is there a known fix to this ? I am using python 3.8, transformers 4.24.0 and tokenizers 0.13.1

adytiwari avatar Jan 10 '23 18:01 adytiwari

I will have a look thanks 😉

ArthurZucker avatar Jan 18 '23 11:01 ArthurZucker

This will be related to the tokenizer library as both reports include fast. Not stale!

ArthurZucker avatar Mar 09 '23 17:03 ArthurZucker

Thanks for your patience 🤗

  1. In the current state, it is not a problem with the tokenizer itself as:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast= True)
tokenizer(tokenizer.mask_token, add_special_tokens=False)

correctly outputs 50264.

  1. Regarding the training of the tokenizer, the notebook works well for me and I cannot reproduce the issue that you are getting. Are you sure that you properly saved the vocabulary and merges with tokenizer.save_model() (using the rust tokenizer) ?

image

ArthurZucker avatar Mar 29 '23 09:03 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 22 '23 15:04 github-actions[bot]