neural-paraphrase-generation
neural-paraphrase-generation copied to clipboard
Remove the <UNK> token at the end of each sentence
Hello,
Thank you for your implementation. However, I found one issue that every sentence ends up with the <UNK> token (like below).
****source == a guy on a bike next to a <UNK>
****target == a bicyclist passing a red commuter bus at a stop on a city <UNK>
****predict == a man riding a bike on a city <UNK>
I dig into the codes and found that the error happens at the function tokenize_and_map in data_handler.py line.split(' ') can't remove the '\n', so that the last token of all the sentences contains '\n' For example: ['a', 'very', 'clean', 'and', 'well', 'decorated', 'empty', 'bathroom\n']
To fix this bug, we just need to change line.split(' ') to line.split(). def tokenize_and_map(self,line): return [self.vocab.get(token, self.UNK_TOKEN) for token in line.split()]
Thanks,
Jack