dkpro-c4corpus icon indicating copy to clipboard operation
dkpro-c4corpus copied to clipboard

Character encoding issues in boilerplate processing

Open tfmorris opened this issue 9 years ago • 2 comments

The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.

tfmorris avatar Apr 04 '16 14:04 tfmorris

After downloading and looking at the original data set, it turns out that the character set decoding being done wrong on the input side. The output looks like it is correctly writing UTF-8, but the characters are already corrupted by then.

In the particular case of 105.html the source encoding is windows-1252. Rather than using the file utilities to read the file into a string, the HTML parser should be allowed to parse the byte stream directly and use the encoding that it finds there. The current scheme will corrupt all non-UTF-8 documents.

tfmorris avatar Apr 08 '16 18:04 tfmorris

And to follow up on my last comment, this only affects the standalone program, not the Hadoop processing. Rather than allowing JSoup do the character set determination, I decided to keep the API the same and use the same character encoding detection that the Hadoop processing does in the standalone program.

I've got a PR with fixes for all the boilerplate problems that I've seen (and some related stuff).

tfmorris avatar Apr 09 '16 17:04 tfmorris