HyperMinHash-java Question about similarity of multiple sketches

First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.

My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54

As I read arXiv:1710.08436, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.

Does the estimation still have same accuracy as of 2-set Jaccard Index ?
If so, is there any proof ?

Sorry for obscure question. Thanks again.

Jul 11 '19 07:07 ocadaruma

@ocadaruma Thanks for your question! It's certainly a very good one.

Does the estimation still have same accuracy as of 2-set Jaccard Index ?

While we haven't written up a mathematical proof, we've done a lot of testing internally at LiveRamp and empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.

If so, is there any proof ?

I have some numbers on this that I'd be more than happy to share once I get some time to polish up the findings. I'll close this issue with that accuracy report and corresponding code.

This being said, you might find this blog post and the accompanying whitepaper helpful in that AdRoll also tested the same idea (but using MinHash and HLL separately instead of HyperMinHash) and both seem to imply the same results we found with our internal testing, although they do not explicitly discuss accuracy of multiple sets.

Jul 11 '19 18:07 sherifnada

Hi there! I'm the author of arXiv:1710.08436. The short answer is that it does indeed still work, for the same reason that MinHash alone allows you to do multiway Jaccard.

The long answer is that the additional error correction we do (i.e. computing expected collisions) does not work as written for >2 sets. It's not particularly difficult to write up an algorithm for that (though it is nontrivial), but we didn't bother. The reason we didn't bother is that when you have >2 sets, the chance that there will be an additional collision introduced by HyperMinHash compression drops rapidly. Thus, the correction factor becomes less necessary. And, of course, so long as the number of additional collisions from using HyperMinHash is close to zero, HyperMinHash behaves exactly the same as MinHash for doing multiway Jaccard.

-Yun William Yu

Jul 11 '19 19:07 yunwilliamyu

Thanks @yunwilliamyu !

Jul 11 '19 20:07 sherifnada

This is great, could it be incorporated in some FAQ.md document?

Jul 11 '19 21:07 joshk0

@sherifnada Great to hear that.

empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.

That would be very helpful. I'm looking forward to that. Thanks.

I'll close this issue with that accuracy report and corresponding code.

And I will check AdRoll's paper as well.

@yunwilliamyu Thank you for detailed explanation ! I understood.

Jul 12 '19 15:07 ocadaruma

HyperMinHash-java HyperMinHash-java copied to clipboard

Question about similarity of multiple sketches

HyperMinHash-java
HyperMinHash-java copied to clipboard