pdsa icon indicating copy to clipboard operation
pdsa copied to clipboard

Recent updates for hyperloglog and MinHash-like algorithms for your book

Open jianshu93 opened this issue 1 year ago • 4 comments

Dear Andrii,

This is Jianshu, a Ph.D student in CSE & Bioinformatics, studying probabilistic data structures and their applications in biological sequences analysis. I read your book on probabilistic data structures and algorithms for big data application, which is a very detailed and thorough introduction of many probabilistic data structures from very beginning. I love it so much and I think this is the first book on this topic despite that fact that many data structures and algorithms books cover some of them but not all.

Since the book was written in 2019 or so, it does cover almost all cutting-edge algorithms in cardinality estimation and similarity search (MinHash et.al.). However, in recently years (including 2 or 3 years before 2019), there are many new breakthroughs such as those related to HyperLogLog and MinHash.

  1. in 2017, Ertl came up with new estimator for HyperLogLog (https://arxiv.org/pdf/1706.07290.pdf), including an improved estimator and a Maximum Likelihood Estimator, which is theoretical optimal (it will be mentioned in below section 4). Improved estimator is much better at small cardinalities but it still has large variation at very small and very large cardinalities (no need to correct bias like HyperLogLog++). MLE is the smallest relative variance both in theory and in practice. However, it is expensive to compute because of the need to solve a three-dimensional optimization problem, which does not easily translate into a fast and robust algorithm. The cardinality of two set, A and B can be used to estimate Jaccard index via the inclusion-exclusion rule for similarity search (competing with MinHash, MinHash is also approximating Jaccard index), which requires much smaller memory than MinHash for similar accuracy (but slower). The paper also proposed a joint Maximum Likelihood Estimator (JMLE), which estimate Jaccard index (estimate intersections and unions first) based on HyperLogLog sketches of set A and B, also theoretical optimal and has smaller variance than the one mentioned just now (use cardinality of A and B, inclusion-exclusion rule). This is even slower for Jaccard estimation but the most accurate. Ertl's work was based on Daniel Ting's pioneering work here (https://dl.acm.org/doi/abs/10.1145/2939672.2939772) and also Cohen's work here (https://dl.acm.org/doi/abs/10.1145/3097983.3097999).
  2. Seth Pettie team developed a more general framework, called Tau-GRA (generalized remaining area), which includes LogLog, PCSA, HyperLogLog, and improves the HyperLogLog estimator relative variance from 0.4% (1.079/m) to 0.04% (1.075/m), compare to the Camer-Rao lower bound which is 1.074/m when Tau is 0.88989 and it does not need correction for small cardinalities. The maximum likelihood estimator mentioned above meets the Camér-Rao lower bound but very expensive, the Tau-GRA is very easy to compute and update than MLE methods, despite not optimal (but very small variance). The paper is here: https://arxiv.org/abs/2208.10578 . Those work are based on the pioneering work of Daniel Ting here (https://dl.acm.org/doi/abs/10.1145/2623330.2623669).
  3. New MinHash algorithm, SuperMinHash, which improves MSE of original MinHash (https://arxiv.org/abs/1706.05698) and ProbMinHash (estimate Jp, not exactly Jaccard), which takes into account set multiplicity/abundance and set size (e.g., several same element in the set, each has to be considered, or relative frequencies because it also normalize by the total set size) (https://ieeexplore.ieee.org/abstract/document/9185081?casa_token=R8ncquggKvsAAAAA:JU4MPQOraTFHfBWCvufavLjEhIN67N269yu9-MVDr29-6kzE0iJ9h9RK-brFvhnRMVNoAvLQ). BagMinHash (https://dl.acm.org/doi/abs/10.1145/3219819.3220089) and DartMinHash (https://arxiv.org/abs/2005.11547), both take into account the weights of element in set (but not set size). Those two approximate Jw instead of simple Jaccard. The latter it the fastest weighted MinHash algorithm.
  4. B-bit one permutation MinHash with optimal densification, theoretically fastest MinHash algorithm, based on work from several other papers but summarize in this paper as a whole: http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf
  5. New data structure, SetSketch, which benefits from MinHash and Hyperloglog, both fast and space efficient. The interesting point of paper is that actually MinHash and HLL can be combined into such one data structure. Several estimators are also derived, see paper here: https://vldb.org/pvldb/vol14/p2244-ertl.pdf

I am wondering whether there will be a new version of the book in the following years. If yes, those topics could be very relevant and interesting to many readers. I would be happier to be part of it.

Looking forward to your reply.

Thanks,

Jianshu

jianshu93 avatar Jun 04 '23 04:06 jianshu93