Tom Morris comments

Results 686 comments of


                                            Tom Morris

HTML entities not decoded

I'm going to revise my opinion about the "correct approach" and turn it into a question. The gold standard doesn't entity encode less than (`

Questions on statistics

Thanks for the quick answers! I'll leave this open to learn the results of the exact duplicates investigation, but I'm happy with everything else. One of the reasons I'm interested...

Questions on statistics

In case it's helpful, here are some stats from a sample segment that I was testing hashing code on: 154,000 total WARC records 32,695 HTTP response with non-empty text after...

Fix O(n!) in tag depth issue

I've revised this PR to complete the fix for #27 and also fix #29 & #30.

I've added an example of the output from the new version for people to look at: https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt Subjectively (and with a sample size of 1), the new version seems substantially...

Fix O(n!) in tag depth issue

I added a fix for #36 and fixed some other issues, but this needs to be rebased against the current master and is missing a couple of later commits that...

SimHash returning 32-bit results, not 64-bits

Oops, ignore the part about word 2 being all zero/one. It'll actually be the same as word 0 because the 32-bit hashcode gets shifted through twice to test "all 64"...

Make Java JusText implementation match Python and/or document differences

Another significant difference is the paragraph detection/segmentation. The Python implementation uses a very simple algorithm. Any "block level" element starts a new paragraph on both its start tag and its...

Make Java JusText implementation match Python and/or document differences

I'm not sure the current results _are_ comparable to the Python version. In my cursory spot checks, I saw some significant differences. Did you look, for example, at https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt ?...

O(n!) processing in tag name/path for Paragraph in dedupe code

The fix needs improvement because, although it fixes the processing time issue, it can still exhaust heap in a constrained environment like a Hadoop cluster. I'm testing a revised version...