Fix simhash slicing and add tests. Fixes #19.
This adds very basic tests for all the static methods in SimHashUtils and fixes the simhash slicing algorithm.
The fixed version uses the current text representation, but I'd actually suggest switching to just using Longs instead of text and computing the slices uses bitmasks. This will not only make the computation of the slices faster and easier to understand, but will speed up the sorting and comparisons during the shuffle phase of clustering (but it does require changes elsewhere in the system).
Good catch, Tom! Since your contributions are getting non-trivial, I'd like to ask you for filling our contributor's license - this is what we apply for all DKPro open-source software (after discussions with the legal department at the Darmstadt Technical University). Please consult http://dkpro.github.io/contributing/ (you can send it via e-mail: [email protected] ). Thanks :)
I hope it did not scare you off, Tom :)
- I'm adding this to 1.0.1 milestone as we want to keep the functionality of 1.0.0 the very same as in the LREC article.
You didn't scare me off. :-) I just needed some quiet time to review the agreement.
Plus, as I suggested above, I've gone off on a different tack and implemented a new hashing scheme and built a little benchmarking framework so I can compare. I'll have more PRs in the pipe as soon as I stop playing around (and sign the CLA).