neural-paraphrase-generation icon indicating copy to clipboard operation
neural-paraphrase-generation copied to clipboard

Remove the <UNK> token at the end of each sentence

Open jackyuanjie1990 opened this issue 4 years ago • 0 comments

Hello,

Thank you for your implementation. However, I found one issue that every sentence ends up with the <UNK> token (like below).
****source == a guy on a bike next to a <UNK> ****target == a bicyclist passing a red commuter bus at a stop on a city <UNK> ****predict == a man riding a bike on a city <UNK>

I dig into the codes and found that the error happens at the function tokenize_and_map in data_handler.py line.split(' ') can't remove the '\n', so that the last token of all the sentences contains '\n' For example: ['a', 'very', 'clean', 'and', 'well', 'decorated', 'empty', 'bathroom\n']

To fix this bug, we just need to change line.split(' ') to line.split(). def tokenize_and_map(self,line): return [self.vocab.get(token, self.UNK_TOKEN) for token in line.split()]

Thanks,

Jack

jackyuanjie1990 avatar Jan 10 '21 21:01 jackyuanjie1990