d2l-en icon indicating copy to clipboard operation
d2l-en copied to clipboard

Tokenization in 8.2.2

Open armingh2000 opened this issue 3 years ago • 1 comments

I think in tokenize function, if it's tokenizing words, it should add space_character to tokens too. Otherwise, in predict function, it will assume '' for spaces and the predictions doesn't have spaces between them (which can be solved by manipulating the predict function to this line: return ''.join([vocab.idx_to_token[i] + ' ' for i in outputs]))

I think tokenized should change like this:

[line.split() for line in lines] + [[' ']]

If I'm right, I can make a pr for both tokenize and predict funcs. (although for predict I might have to change inputs of function as well to recognize if it's a char level or word lever rnn)

armingh2000 avatar Jun 22 '21 07:06 armingh2000

The problem can be fixed by only changing the predict function. So there is no need to change tokenize function

armingh2000 avatar Jun 22 '21 09:06 armingh2000

Have you looked at the refactored implementation? We do add space chars to tokens too. I'm closing this as resolved, if you feel something is missing, feel free to re-open!

AnirudhDagar avatar Dec 16 '22 00:12 AnirudhDagar