language-learning icon indicating copy to clipboard operation
language-learning copied to clipboard

MS-ANSI chars break sentence splitter in pre-cleaner

Open glicerico opened this issue 6 years ago • 3 comments

MS-ANSI characters \x91, \x92, \x93, \x94, present in one Gutenberg Children's file (9255-0.txt) break sentence splitter, part of pre-cleaner

glicerico avatar Jun 18 '18 13:06 glicerico

Solved by creating an MS-ANSI to UTF-8 converter for those characters in PR https://github.com/singnet/language-learning/pull/28

glicerico avatar Jun 18 '18 13:06 glicerico

The created converter breaks other files, apparently changes other characters that the sentence-splitter doesn't recognize now.

glicerico avatar Jun 18 '18 13:06 glicerico

The list of MS-ANSI chars not recognized by utf-8 is framed in green in the table at: https://en.wikipedia.org/wiki/Windows-1252#Character_set

glicerico avatar Jun 18 '18 14:06 glicerico