tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

"return_special_tokens_mask" does not mask new tokens when added via "add_special_tokens"

Open vgoklani opened this issue 3 years ago • 4 comments

Consider the following example:

from transformers.models.roberta.tokenization_roberta_fast import RobertaTokenizerFast

pretrained_model_name_or_path = "distilroberta-base"
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_model_name_or_path, use_fast=True)

special_tokens_dict = {'additional_special_tokens': ['BASEBALL_CARDS']}
tokenizer.add_special_tokens(special_tokens_dict)

text = "Let's go buy some BASEBALL_CARDS at Yankee stadium"
encoding = tokenizer.encode_plus(text, return_special_tokens_mask=True)

" ".join(tokenizer.convert_ids_to_tokens(input_id) for input_id in encoding['input_ids'])

# "<s> Let 's Ä go Ä buy Ä some Ä  BASEBALL_CARDS Ä at Ä Yankee Ä stadium </s>"

encoding['special_tokens_mask']

# [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

The special_tokens_mask masks both the <s> and the </s> tokens, but skips the newly created "BASEBALL_CARDS" special token. Is this the intended behavior, or a bug?

Also, when does one use "add_special_tokens" vs "add_tokens"?

Thanks!

vgoklani avatar Aug 31 '21 14:08 vgoklani

I noticed this too. One work around is to call

special_tokens_mask = tokenizer.get_special_tokens_mask(
    input_ids.tolist(), already_has_special_tokens=True
)

The above worked for me while return_special_tokens_mask did not.

lorr1 avatar Sep 01 '21 23:09 lorr1

Politely ping

Yevgnen avatar Feb 11 '22 05:02 Yevgnen

Thanks for the gentle ping.

Created a tentative PR for this: #907.

We need to make a full check of this doesn't break other things elsewhere.

@SaulLu do you mind having a look at this ?

Is this something that might break other things elsewhere, ? What additional checks should we make ?

Narsil avatar Feb 15 '22 14:02 Narsil

Actually this seems to really break things on transformers.

Results (107.92s):
    4890 passed
      31 failed
         - tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 DPRContextEncoderTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 DPRContextEncoderTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 DistilBertTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 DistilBertTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 DPRReaderTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 DPRReaderTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:2764 PreTrainedTokenizationFastTest.test_offsets_mapping
         - tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 DPRQuestionEncoderTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 DPRQuestionEncoderTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 FunnelTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 FunnelTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 LxmertTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 LxmertTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 LayoutLMTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 LayoutLMTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 OpenAIGPTTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 OpenAIGPTTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 MPNetTokenizerTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 MPNetTokenizerTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 RealmTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 RealmTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
         - tests/test_tokenization_common.py:1339 SqueezeBertTokenizationTest.test_special_tokens_mask
         - tests/test_tokenization_common.py:1356 SqueezeBertTokenizationTest.test_special_tokens_mask_input_pairs
     954 skipped

@Yevgnen , it seems that it might be tricker to add this.

In those tests, [UNK] is encoded, which are special_tokens, but we still don't want to mask them in the output.

Narsil avatar Feb 16 '22 12:02 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 18 '24 01:03 github-actions[bot]