tokenizers
tokenizers copied to clipboard
"return_special_tokens_mask" does not mask new tokens when added via "add_special_tokens"
Consider the following example:
from transformers.models.roberta.tokenization_roberta_fast import RobertaTokenizerFast
pretrained_model_name_or_path = "distilroberta-base"
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_model_name_or_path, use_fast=True)
special_tokens_dict = {'additional_special_tokens': ['BASEBALL_CARDS']}
tokenizer.add_special_tokens(special_tokens_dict)
text = "Let's go buy some BASEBALL_CARDS at Yankee stadium"
encoding = tokenizer.encode_plus(text, return_special_tokens_mask=True)
" ".join(tokenizer.convert_ids_to_tokens(input_id) for input_id in encoding['input_ids'])
# "<s> Let 's Ä go Ä buy Ä some Ä BASEBALL_CARDS Ä at Ä Yankee Ä stadium </s>"
encoding['special_tokens_mask']
# [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
The special_tokens_mask masks both the <s> and the </s> tokens, but skips the newly created "BASEBALL_CARDS" special token. Is this the intended behavior, or a bug?
Also, when does one use "add_special_tokens" vs "add_tokens"?
Thanks!
I noticed this too. One work around is to call
special_tokens_mask = tokenizer.get_special_tokens_mask(
input_ids.tolist(), already_has_special_tokens=True
)
The above worked for me while return_special_tokens_mask
did not.
Politely ping
Thanks for the gentle ping.
Created a tentative PR for this: #907.
We need to make a full check of this doesn't break other things elsewhere.
@SaulLu do you mind having a look at this ?
Is this something that might break other things elsewhere, ? What additional checks should we make ?
Actually this seems to really break things on transformers
.
Results (107.92s):
4890 passed
31 failed
- tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 DPRContextEncoderTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 DPRContextEncoderTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 DistilBertTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 DistilBertTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 DPRReaderTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 DPRReaderTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:2764 PreTrainedTokenizationFastTest.test_offsets_mapping
- tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 DPRQuestionEncoderTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 DPRQuestionEncoderTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 FunnelTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 FunnelTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 LxmertTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 LxmertTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 LayoutLMTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 LayoutLMTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 OpenAIGPTTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 OpenAIGPTTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 MPNetTokenizerTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 MPNetTokenizerTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 RealmTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 RealmTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 BertTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 BertTokenizationTest.test_special_tokens_mask_input_pairs
- tests/test_tokenization_common.py:1339 SqueezeBertTokenizationTest.test_special_tokens_mask
- tests/test_tokenization_common.py:1356 SqueezeBertTokenizationTest.test_special_tokens_mask_input_pairs
954 skipped
@Yevgnen , it seems that it might be tricker to add this.
In those tests, [UNK]
is encoded, which are special_tokens, but we still don't want to mask them in the output.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.