dkpro-c4corpus icon indicating copy to clipboard operation
dkpro-c4corpus copied to clipboard

Fix simhash slicing and add tests. Fixes #19.

Open tfmorris opened this issue 9 years ago • 3 comments

This adds very basic tests for all the static methods in SimHashUtils and fixes the simhash slicing algorithm.

The fixed version uses the current text representation, but I'd actually suggest switching to just using Longs instead of text and computing the slices uses bitmasks. This will not only make the computation of the slices faster and easier to understand, but will speed up the sorting and comparisons during the shuffle phase of clustering (but it does require changes elsewhere in the system).

tfmorris avatar Mar 15 '16 18:03 tfmorris

Good catch, Tom! Since your contributions are getting non-trivial, I'd like to ask you for filling our contributor's license - this is what we apply for all DKPro open-source software (after discussions with the legal department at the Darmstadt Technical University). Please consult http://dkpro.github.io/contributing/ (you can send it via e-mail: [email protected] ). Thanks :)

habernal avatar Mar 15 '16 19:03 habernal

I hope it did not scare you off, Tom :)

  • I'm adding this to 1.0.1 milestone as we want to keep the functionality of 1.0.0 the very same as in the LREC article.

habernal avatar Mar 21 '16 07:03 habernal

You didn't scare me off. :-) I just needed some quiet time to review the agreement.

Plus, as I suggested above, I've gone off on a different tack and implemented a new hashing scheme and built a little benchmarking framework so I can compare. I'll have more PRs in the pipe as soon as I stop playing around (and sign the CLA).

tfmorris avatar Mar 22 '16 00:03 tfmorris