dkpro-c4corpus icon indicating copy to clipboard operation
dkpro-c4corpus copied to clipboard

Boilerplate removal header post processing incorrect

Open tfmorris opened this issue 9 years ago • 0 comments

The conditional here is wrong: https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350 causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head hurt and are error prone, so I'd recommend using normal logic which matches the algorithm descriptions. ie In this case, instead of:

        if (!(paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad"))) {
            continue;
        }

use

        if (paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad")) {

The current code goes pathologically wrong in the case of documents with a large number empty elements (45,000 "paragraphs", a large number of which were consecutive <br> elements in the example I looked at). In this case the 200 character distance limit never gets reached to trigger the loop exit, causing O(n!) processing of 45,000 elements.

This suggests a couple other possible improvements:

  • compress runs of more than 2 <br> elements
  • introduce a max number of elements distance limit in addition to the max number of character limit

tfmorris avatar Apr 10 '16 20:04 tfmorris