gaoya
gaoya copied to clipboard
Optimization Ideas For MinHash
Noticed your comment on Hacker News about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked pretty deeply into optimizing minhash. Not sure if it's helpful or if you're already aware of these tricks, but I found them to significantly speed up my minhash algorithm (and they are fun to code :P).
- One Permutation Hashing. Basically use a single hash function in clever way instead of the typical 128. Made a pretty big different in speed during my testing.
- Densified One Permutation Hashing. A small tweak for handling the "empty bin" problem of one permutation hashing.
- Reservoir Sampling. A trick for using less memory and avoiding dynamic allocations. This last paper also has a summary of the first two.
Caveat: Seems like you're working on Super Min Hash, which I was never able to fully code myself. I don't know how it compares to the one-hash approach.
Thanks for your suggestions. I may have encountered some of these ideas in papers, but never took a deep look. Currently, I use MinHash for relatively short text documents (usually 50 - 200 characters), and generating min-hash signatures takes a small fraction of the whole pipeline. For bigger documents hashing can become a bottleneck. Memory usage is definitely an issue for me, which I mitigate by using smaller hashes (thanks to Rust generics this is easy). I should definitely look at Reservoir Sampling and One Permutation Hashing, (especially if they are fun to code :) )
Hello All,
Just a followup, the most recently densification is called optimal densification: two ways to perform the densification are optimal (RMSE is the same with the original, much slower, many hash function implementation): http://proceedings.mlr.press/v70/shrivastava17a.html or http://proceedings.mlr.press/v115/mai20a.html. Implementation here: https://github.com/jean-pierreBoth/probminhash/blob/master/src/densminhash.rs
@serega I have some interesting idea to combine minhash with graphs to do very large scale search. Let me know if you are interested.
Jianshu