dkpro-c4corpus Make Java JusText implementation match Python and/or document differences

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it than that. I think the differences in the output of the two implementations should be explainable, and preferably should be improvements.

Some differences that I know of (some have already been reported as bugs) include:

header postprocessing implemented incorrectly (#36)
<select> elements not automatically tagged as boilerplate as described in the algorithm description at http://corpus.tools/wiki/Justext/Algorithm
HTML entity decoding not done (#30)
min/max lengths implemented as doubles instead of integers (probably doesn't affect the output, but it seems an unnecessary deviation)
<textarea> is marked as ignorable rather than a block level separator
block level tags are computed using JSoup's Element.isBlock() method, rather than the list of tags defined by the jusText algorithm resulting in a different tag set being used for paragraph splitting. The sets have substantial overlap and JSoup's may be better, but I'm not sure the difference is anything other than arbitrary.
<br><br> handling is different

Bottom line - what's implemented is not the JusText algorithm as documented.

Apr 10 '16 20:04 tfmorris

Another significant difference is the paragraph detection/segmentation.

The Python implementation uses a very simple algorithm. Any "block level" element starts a new paragraph on both its start tag and its end tag. All text seen until another new paragraph is started gets added. Empty paragraphs are dropped.

The Java implementation uses a complicated backtracking algorithm that attempts to find a common ancestor between two nodes, then inspects all the elements along the paths to the common ancestor. This is O(n!) in tree depth which normally isn't a huge deal (although still wasteful), but in pathological deeply nested cases it can cause mappers to never complete.

I'm not sure what problem, real or imagined, the Java implementation was designed to solve, but the simple Python version seems to perform just fine (and in linear time & space).

Apr 12 '16 18:04 tfmorris

Thanks for these very helpful insights! I agree that the implementation might be far from being optimal. Deviations are there, indeed, but since the results were comparable to the Python implementation, we didn't spend that much time on fine tuning (also given constrained project resources). I'm leaving that issue open to be considered for the next release.

Apr 15 '16 14:04 habernal

We're happy about any contributions :)

Apr 15 '16 14:04 reckart

I'm not sure the current results are comparable to the Python version. In my cursory spot checks, I saw some significant differences.

Did you look, for example, at https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt ? It currently preserves line breaks, which makes the diff difficult, but comparing Gold, Python, old Java, & new Java by eye shows some pretty significant differences.

BTW, in reference to my previous comment about ancestor processing for paragraph detection, I think I might have found the motivation for the fancy paragraph processing in the Java version. It looks suspiciously like this description of paragraphs from the HTML5 spec:

https://www.w3.org/TR/html/dom.html#paragraph

Apr 15 '16 14:04 tfmorris