BERT-pytorch icon indicating copy to clipboard operation
BERT-pytorch copied to clipboard

Vocab Replace \t to blank issue

Open NiHaoUCAS opened this issue 5 years ago • 2 comments

when the corpus is: how are you \ tnice to meet you and apply bert-vocab cmd, the output of the vacab is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'you', 'are', 'how', 'meet', 'nice', 'to'].
But when change the corputs to how are you\tnice to meet you, the result is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'are', 'how', 'meet', 'to', 'you', 'younice'], the last token become younice. a <'blank'> need on both sides of <'\t'>. it's may not a bug.

NiHaoUCAS avatar Oct 23 '18 13:10 NiHaoUCAS

I think this is a bug. And the problem is that in vocab.y the 127th line words = line.replace("\n", "").replace("\t", "").split() \t is replaced by "". I think it should by replaced by a space.

jiqiujia avatar Oct 23 '18 16:10 jiqiujia

I'll update the vocab builder ASAP! thanx

codertimo avatar Oct 24 '18 04:10 codertimo