d2l-en
d2l-en copied to clipboard
Tokenization in 8.2.2
I think in tokenize function, if it's tokenizing words, it should add space_character to tokens too. Otherwise, in predict function, it will assume 'return ''.join([vocab.idx_to_token[i] + ' ' for i in outputs])
)
I think tokenized should change like this:
[line.split() for line in lines] + [[' ']]
If I'm right, I can make a pr for both tokenize and predict funcs. (although for predict I might have to change inputs of function as well to recognize if it's a char level or word lever rnn)
The problem can be fixed by only changing the predict function. So there is no need to change tokenize function
Have you looked at the refactored implementation? We do add space chars to tokens too. I'm closing this as resolved, if you feel something is missing, feel free to re-open!