language-learning
language-learning copied to clipboard
MS-ANSI chars break sentence splitter in pre-cleaner
MS-ANSI characters \x91, \x92, \x93, \x94, present in one Gutenberg Children's file (9255-0.txt) break sentence splitter, part of pre-cleaner
Solved by creating an MS-ANSI to UTF-8 converter for those characters in PR https://github.com/singnet/language-learning/pull/28
The created converter breaks other files, apparently changes other characters that the sentence-splitter doesn't recognize now.
The list of MS-ANSI chars not recognized by utf-8 is framed in green in the table at: https://en.wikipedia.org/wiki/Windows-1252#Character_set