UNK isn't in the special_tokens_mask list on encodings 💩
Related to #504 UNK is a special token. I'd expect it to have a 1 in the special tokens mask. But if I do
from tokenizers import Tokenizer
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("/tmp/bert-base-uncased-vocab.txt", lowercase=True)
text = "💩💩💩 💩 💩💩"
encoding = tokenizer.encode(text, add_special_tokens=True)
for token_ix,token in enumerate(encoding.tokens):
assert encoding.special_tokens_mask[token_ix]==1, f"My 💩💩💩 isn't special even though its {token}"
I get
AssertionError: My 💩💩💩 isn't special even though its [UNK]
Language is hard (that's why we're here), but I'd say either UNK is a special token or its not. If we initialize a tokenizeer and say that UNK is a special token then it's special and should get the same treatment as all the other special tokens.
I presume that the more important part of that special_tokens_mask accessor is the mask part, which probably has lots of downstream dependents and isn't changeable. Nevertheless I wanted to share.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.