Boilerplate removal header post processing incorrect
The conditional here is wrong: https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350 causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head hurt and are error prone, so I'd recommend using normal logic which matches the algorithm descriptions. ie In this case, instead of:
if (!(paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
&& !paragraph.getContextFreeClass().equalsIgnoreCase("bad"))) {
continue;
}
use
if (paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
&& !paragraph.getContextFreeClass().equalsIgnoreCase("bad")) {
The current code goes pathologically wrong in the case of documents with a large number empty elements (45,000 "paragraphs", a large number of which were consecutive <br> elements in the example I looked at). In this case the 200 character distance limit never gets reached to trigger the loop exit, causing O(n!) processing of 45,000 elements.
This suggests a couple other possible improvements:
- compress runs of more than 2
<br>elements - introduce a max number of elements distance limit in addition to the max number of character limit