BERT-pytorch
BERT-pytorch copied to clipboard
Vocab Replace \t to blank issue
when the corpus is:
how are you \ tnice to meet you
and apply bert-vocab
cmd, the output of the vacab is
['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'you', 'are', 'how', 'meet', 'nice', 'to']
.
But when change the corputs to
how are you\tnice to meet you
, the result is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'are', 'how', 'meet', 'to', 'you', 'younice']
, the last token become younice
.
a <'blank'> need on both sides of <'\t'>.
it's may not a bug.
I think this is a bug. And the problem is that in vocab.y the 127th line
words = line.replace("\n", "").replace("\t", "").split()
\t is replaced by "". I think it should by replaced by a space.
I'll update the vocab builder ASAP! thanx