dkpro-c4corpus icon indicating copy to clipboard operation
dkpro-c4corpus copied to clipboard

SimHash returning 32-bit results, not 64-bits

Open tfmorris opened this issue 9 years ago • 1 comments

Although the code and paper suggest that 64-bit hashes are being used, the Java Object.hashCode() function only returns 32 bits. The good news is that the bug in #19 has no effect since the upper 16-bits are always 0 (or perhaps all 1s, depending on sign extension effects).

The bad news is that because bits 32-47 are either all zero (or perhaps evenly divided between all zero & all one), I suspect all (or at least half) of the documents will end up being clustered together, making for a very expensive O(n^2) comparison.

You can probably ignore PR #20 for now. It'll get subsumed into the larger rework necessary.

tfmorris avatar Mar 15 '16 22:03 tfmorris

Oops, ignore the part about word 2 being all zero/one. It'll actually be the same as word 0 because the 32-bit hashcode gets shifted through twice to test "all 64" bits, so the upper 32 bits will be duplicates of the lower 32 bits.

tfmorris avatar Mar 15 '16 23:03 tfmorris