dkpro-c4corpus icon indicating copy to clipboard operation
dkpro-c4corpus copied to clipboard

Make Java JusText implementation match Python and/or document differences

Open tfmorris opened this issue 9 years ago • 4 comments

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it than that. I think the differences in the output of the two implementations should be explainable, and preferably should be improvements.

Some differences that I know of (some have already been reported as bugs) include:

  • header postprocessing implemented incorrectly (#36)
  • <select> elements not automatically tagged as boilerplate as described in the algorithm description at http://corpus.tools/wiki/Justext/Algorithm
  • HTML entity decoding not done (#30)
  • min/max lengths implemented as doubles instead of integers (probably doesn't affect the output, but it seems an unnecessary deviation)
  • <textarea> is marked as ignorable rather than a block level separator
  • block level tags are computed using JSoup's Element.isBlock() method, rather than the list of tags defined by the jusText algorithm resulting in a different tag set being used for paragraph splitting. The sets have substantial overlap and JSoup's may be better, but I'm not sure the difference is anything other than arbitrary.
  • <br><br> handling is different

Bottom line - what's implemented is not the JusText algorithm as documented.

tfmorris avatar Apr 10 '16 20:04 tfmorris

Another significant difference is the paragraph detection/segmentation.

The Python implementation uses a very simple algorithm. Any "block level" element starts a new paragraph on both its start tag and its end tag. All text seen until another new paragraph is started gets added. Empty paragraphs are dropped.

The Java implementation uses a complicated backtracking algorithm that attempts to find a common ancestor between two nodes, then inspects all the elements along the paths to the common ancestor. This is O(n!) in tree depth which normally isn't a huge deal (although still wasteful), but in pathological deeply nested cases it can cause mappers to never complete.

I'm not sure what problem, real or imagined, the Java implementation was designed to solve, but the simple Python version seems to perform just fine (and in linear time & space).

tfmorris avatar Apr 12 '16 18:04 tfmorris

Thanks for these very helpful insights! I agree that the implementation might be far from being optimal. Deviations are there, indeed, but since the results were comparable to the Python implementation, we didn't spend that much time on fine tuning (also given constrained project resources). I'm leaving that issue open to be considered for the next release.

habernal avatar Apr 15 '16 14:04 habernal

We're happy about any contributions :)

reckart avatar Apr 15 '16 14:04 reckart

I'm not sure the current results are comparable to the Python version. In my cursory spot checks, I saw some significant differences.

Did you look, for example, at https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt ? It currently preserves line breaks, which makes the diff difficult, but comparing Gold, Python, old Java, & new Java by eye shows some pretty significant differences.

BTW, in reference to my previous comment about ancestor processing for paragraph detection, I think I might have found the motivation for the fancy paragraph processing in the Java version. It looks suspiciously like this description of paragraphs from the HTML5 spec:

https://www.w3.org/TR/html/dom.html#paragraph

tfmorris avatar Apr 15 '16 14:04 tfmorris