ochre
ochre copied to clipboard
All chars assumption
Hi, The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step. Is it ok? Or this is something that needs to be addressed?
Thanks! Omri
Actually, the chars are extracted from all text (train set, test set, and val set).
Whether this is correct (fair) is open for discussion. It is probably more correct to use only the characters in the train set (and maybe validation set) and have an 'unknown' character. It is likely that the 'unknown' character only appears in the input text, and not in the output text. Otherwise incorrect text will be produced.
I've solved this issue by adding another param with chars to include.
BTW, do you want me to contribute these changes? I fill like it is very specific to my needs, but if you like...