HTML entities not decoded
Comparing these two files:
- /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt
- /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt
It appears that the Python program is dropping entities, but not decoding some other such as <. The gold standard doesn't include any HTML entities, naturally. I'd argue that the correct approach is to decode all HTML entities and convert them to their equivalent Unicode character, even though this is different from what the original Python program did.
+1 ;)
I've submitted a fix for this. When the full CleanEval corpus is re-run, I'd suggest having it generate the minimal HTML tags, since the tags are included in the gold standard.
I'm going to revise my opinion about the "correct approach" and turn it into a question. The gold standard doesn't entity encode less than (<) or ampersand (&) characters which means that it's not legal X(HT)ML (but it also uses made up tags like <l> for lists), so there's a tension between doing what is useful for comparison with the gold standard and doing what's most convenient for consumers.
It's pretty clear that the text mode should be fully decoded, but should the minimal HTML mode match the gold standard or produce legal XML? Is a third mode needed?