HyperMinHash-java
HyperMinHash-java copied to clipboard
Question about similarity of multiple sketches
First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.
My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54
As I read arXiv:1710.08436
, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.
- Does the estimation still have same accuracy as of 2-set Jaccard Index ?
- If so, is there any proof ?
Sorry for obscure question. Thanks again.
@ocadaruma Thanks for your question! It's certainly a very good one.
Does the estimation still have same accuracy as of 2-set Jaccard Index ?
While we haven't written up a mathematical proof, we've done a lot of testing internally at LiveRamp and empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.
If so, is there any proof ?
I have some numbers on this that I'd be more than happy to share once I get some time to polish up the findings. I'll close this issue with that accuracy report and corresponding code.
This being said, you might find this blog post and the accompanying whitepaper helpful in that AdRoll also tested the same idea (but using MinHash and HLL separately instead of HyperMinHash) and both seem to imply the same results we found with our internal testing, although they do not explicitly discuss accuracy of multiple sets.
Hi there! I'm the author of arXiv:1710.08436. The short answer is that it does indeed still work, for the same reason that MinHash alone allows you to do multiway Jaccard.
The long answer is that the additional error correction we do (i.e. computing expected collisions) does not work as written for >2 sets. It's not particularly difficult to write up an algorithm for that (though it is nontrivial), but we didn't bother. The reason we didn't bother is that when you have >2 sets, the chance that there will be an additional collision introduced by HyperMinHash compression drops rapidly. Thus, the correction factor becomes less necessary. And, of course, so long as the number of additional collisions from using HyperMinHash is close to zero, HyperMinHash behaves exactly the same as MinHash for doing multiway Jaccard.
-Yun William Yu
Thanks @yunwilliamyu !
This is great, could it be incorporated in some FAQ.md document?
@sherifnada Great to hear that.
empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.
That would be very helpful. I'm looking forward to that. Thanks.
I'll close this issue with that accuracy report and corresponding code.
And I will check AdRoll's paper as well.
@yunwilliamyu Thank you for detailed explanation ! I understood.