tokenizers
tokenizers copied to clipboard
bug when using special token with uppercase
I train the BERT wordpiece by my own, with do_lowercase=True. In my corpus, I pre-replace some token to [UNK] and [SEP], and both [UNK] and [SEP] are in the default special tokens.
after the training finish, I find there are several undesired tokens in the vocab.txt.
##sep]
[sep]
[unk
[unk]
I grep my traning corpus, and cannot find these tokens.
I guess do_lowercase will break the upper_case special tokens.
even I set do_lowercase=False
, there are still some undesired tokens.
[S
##EP
[SEP
That does seem like a bug. Looking into how to fix that properly.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.