tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

UNK isn't in the special_tokens_mask list on encodings 💩

Open talolard opened this issue 5 years ago • 1 comments

Related to #504 UNK is a special token. I'd expect it to have a 1 in the special tokens mask. But if I do

    from tokenizers import Tokenizer
    from tokenizers import BertWordPieceTokenizer
    tokenizer = BertWordPieceTokenizer("/tmp/bert-base-uncased-vocab.txt", lowercase=True)

    text = "💩💩💩 💩 💩💩"
    encoding = tokenizer.encode(text, add_special_tokens=True)
    for token_ix,token in enumerate(encoding.tokens):
        assert encoding.special_tokens_mask[token_ix]==1, f"My 💩💩💩 isn't special even though its {token}"

I get

AssertionError: My 💩💩💩 isn't special even though its [UNK]

Language is hard (that's why we're here), but I'd say either UNK is a special token or its not. If we initialize a tokenizeer and say that UNK is a special token then it's special and should get the same treatment as all the other special tokens.

I presume that the more important part of that special_tokens_mask accessor is the mask part, which probably has lots of downstream dependents and isn't changeable. Nevertheless I wanted to share.

talolard avatar Nov 04 '20 19:11 talolard

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 08 '24 01:05 github-actions[bot]