tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

bug when using special token with uppercase

Open guolinke opened this issue 4 years ago • 3 comments

I train the BERT wordpiece by my own, with do_lowercase=True. In my corpus, I pre-replace some token to [UNK] and [SEP], and both [UNK] and [SEP] are in the default special tokens.

after the training finish, I find there are several undesired tokens in the vocab.txt.

##sep]
[sep]
[unk
[unk]

I grep my traning corpus, and cannot find these tokens.

I guess do_lowercase will break the upper_case special tokens.

guolinke avatar Sep 26 '20 02:09 guolinke

even I set do_lowercase=False, there are still some undesired tokens.

[S
##EP
[SEP

guolinke avatar Sep 26 '20 04:09 guolinke

That does seem like a bug. Looking into how to fix that properly.

Narsil avatar Sep 28 '20 11:09 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 12 '24 01:05 github-actions[bot]