ochre All chars assumption

All chars assumption

Open omrishsu opened this issue 6 years ago • 2 comments

Hi, The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step. Is it ok? Or this is something that needs to be addressed?

Thanks! Omri

Mar 03 '18 07:03 omrishsu

Actually, the chars are extracted from all text (train set, test set, and val set).

Whether this is correct (fair) is open for discussion. It is probably more correct to use only the characters in the train set (and maybe validation set) and have an 'unknown' character. It is likely that the 'unknown' character only appears in the input text, and not in the output text. Otherwise incorrect text will be produced.

Mar 05 '18 19:03 jvdzwaan

I've solved this issue by adding another param with chars to include.

BTW, do you want me to contribute these changes? I fill like it is very specific to my needs, but if you like...

Mar 09 '18 07:03 omrishsu

ochre ochre copied to clipboard

All chars assumption

ochre
ochre copied to clipboard