Tom Morris
Tom Morris
I'm going to revise my opinion about the "correct approach" and turn it into a question. The gold standard doesn't entity encode less than (`
Thanks for the quick answers! I'll leave this open to learn the results of the exact duplicates investigation, but I'm happy with everything else. One of the reasons I'm interested...
In case it's helpful, here are some stats from a sample segment that I was testing hashing code on: 154,000 total WARC records 32,695 HTTP response with non-empty text after...
I've revised this PR to complete the fix for #27 and also fix #29 & #30.
I've added an example of the output from the new version for people to look at: https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt Subjectively (and with a sample size of 1), the new version seems substantially...
I added a fix for #36 and fixed some other issues, but this needs to be rebased against the current master and is missing a couple of later commits that...
Oops, ignore the part about word 2 being all zero/one. It'll actually be the same as word 0 because the 32-bit hashcode gets shifted through twice to test "all 64"...
Another significant difference is the paragraph detection/segmentation. The Python implementation uses a very simple algorithm. Any "block level" element starts a new paragraph on both its start tag and its...
I'm not sure the current results _are_ comparable to the Python version. In my cursory spot checks, I saw some significant differences. Did you look, for example, at https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt ?...
The fix needs improvement because, although it fixes the processing time issue, it can still exhaust heap in a constrained environment like a Hadoop cluster. I'm testing a revised version...