streaming_algorithms icon indicating copy to clipboard operation
streaming_algorithms copied to clipboard

better HLL estimator

Open jianshu93 opened this issue 1 year ago • 8 comments

Hello Team,

It seems that new cardinality estimator paper has two new estimators, the improved estimator and MLE estimator, better than the one originally proposed, see paper here: https://arxiv.org/pdf/1706.07290.pdf

Just wondering whether it worth trying, C++ version can be found in the paper.

Thank you,

Jianshu

jianshu93 avatar Dec 11 '23 21:12 jianshu93

@jianshu93 FYI, I implemented that estimator in my hyperloglog - it is still slow AF and only statistically better for what entails the union estimation. Maybe with better optimizers, one may get better performance, but when there is a need to compute logarithms in a core loop things can only go that fast.

LucaCappelletti94 avatar Aug 15 '24 13:08 LucaCappelletti94

Hi @LucaCappelletti94,

Thanks for letting me know! Did you benchmarked against the MLE and improved estimator in Ertl's paper, as implemented in SourMash (https://docs.rs/sourmash/0.15.0/sourmash/sketch/hyperloglog/estimators/index.html)? This is a direct import of the C++ version from the original paper. It would be nice to have a benchmark and I would be happy to play with your implementation.

Best,

Jianshu

jianshu93 avatar Aug 15 '24 13:08 jianshu93

I am running several benchmarks, I can add also that one in. Thanks for pointing that one out.

LucaCappelletti94 avatar Aug 15 '24 13:08 LucaCappelletti94

And also this one, hypertwobits(https://github.com/axiomhq/hypertwobits/tree/main), a very new one, paper was published just last month. I am a little bit concerned about it since the benchmarks showed very large variation even than the original hyperloglog. Let me know your thoughts on it also.

Jianshu

jianshu93 avatar Aug 15 '24 14:08 jianshu93

Hi @LucaCappelletti94,

Thanks for letting me know! Did you benchmarked against the MLE and improved estimator in Ertl's paper, as implemented in SourMash (https://docs.rs/sourmash/0.15.0/sourmash/sketch/hyperloglog/estimators/index.html)? This is a direct import of the C++ version from the original paper. It would be nice to have a benchmark and I would be happy to play with your implementation.

Best,

Jianshu

FYI, I have just read their joint estimation algo and it does not match the one from Otmar. I am quite sure of the correctness of my implementation since I bothered Otmar several times to check it together, so the only thing that remains to verify is which one is more accurate and their speeds.

LucaCappelletti94 avatar Aug 15 '24 14:08 LucaCappelletti94

And also this one, hypertwobits(https://github.com/axiomhq/hypertwobits/tree/main), a very new one, paper was published just last month. I am a little bit concerned about it since the benchmarks showed very large variation even than the original hyperloglog. Let me know your thoughts on it also.

Jianshu

Yeah, those results seem sus, especially for a 2-bit register thinghy. We'll see in a bit, currently adding to lineup.

LucaCappelletti94 avatar Aug 15 '24 14:08 LucaCappelletti94

I would also be interested in UltraLogLog (https://dl.acm.org/doi/abs/10.14778/3654621.3654632) and ExaLogLog (https://arxiv.org/abs/2402.13726). But no implementation is available in Rust. All Ertl's implementation is in the hash4j java library, which I do not like. Since you have so many benchmarks there, I see it as a paper describing and available HLL-like implementations.

Jianshu

jianshu93 avatar Aug 15 '24 14:08 jianshu93

I would also be interested in UltraLogLog (https://dl.acm.org/doi/abs/10.14778/3654621.3654632) and ExaLogLog (https://arxiv.org/abs/2402.13726). But no implementation is available in Rust. All Ertl's implementation is in the hash4j java library, which I do not like.

Jianshu

Samesies on Java. A bit hard to do a fair comparison there and I do not have the time to re-implement it, at least, not now.

LucaCappelletti94 avatar Aug 15 '24 14:08 LucaCappelletti94