presto Add KLL Sketch Functions

Description

This change adds KLL Sketch support from the Apache DataSketches library. One of the benefits of KLL sketches is that the implementation supports more datatypes than the existing quantile digest and T-Digest implementations. It also has provably-bounded error compared to the T-Digest

The following datatype is added

kllsketch(T): Basically a wrapper around varbinary, similar to tdigest and qdigest types

The following functions are added

sketch_kll: computes a KLL sketch
sketch_kll_with_k: computes a kll sketch with the given value for k
sketch_kll_quantile: queries the sketch for a particular quantile
sketch_kll_rank: queries the sketch for a particular rank (inverse of the quantile function)

Motivation and Context

Eventual use by the optimizer after #21236. But also good to have since HMS is releasing support for them soon.

Impact

Two new aggregation functions, two new scalar functions

Test Plan

Standard aggregation unit tests and scalar function unit tests

Contributor checklist

[x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
[x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
[x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
[x] If release notes are required, they follow the release notes guidelines.
[x] Adequate tests were added if applicable.
[x] CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* New support for Apache DataSketches KLL sketch with the sketch_kll and related family of functions

Dec 18 '23 23:12 ZacBlanco

Codenotify: Notifying subscribers in CODENOTIFY files for diff 0907cd8374b674783e6ab4415eca096dcbae8737...4df29adc1728d3051ea4cb07ca083c704804a871.

Notify	File(s)
@steveburnett	presto-docs/src/main/sphinx/functions/sketch.rst presto-docs/src/main/sphinx/language/types.rst

Dec 18 '23 23:12 github-actions[bot]

I think we need to document the serialized binary format, similar to T-Digest and QDigest. Otherwise, they're completely unreadable by any other system.

Feb 13 '24 16:02 tdcmeehan

I think we need to document the serialized binary format, similar to T-Digest and QDigest.

The serialized binary format is specified by the Apache Datasketches library. This is already noted in the docs changes.

https://github.com/prestodb/presto/pull/21568/files#diff-33bc3be2c487c146b213ef7b64ddc7c91956aa6e8b5d6207b217061312286b37R47-R48

If this is confusing I can try to update the phrasing

Feb 13 '24 16:02 ZacBlanco

I can't find where they explicitly document the binary format--is there a link for that?

Feb 13 '24 20:02 tdcmeehan

The binary format is not documented unfortunately. However, all versions (cpp, java, go, etc) of the datasketches libraries can serialize and deserialize the sketches. e.g. java can deserialize cpp serialized sketches, and vice versa. AFAIK this is the case for all sketches coming from the datasketches library.

Feb 13 '24 21:02 ZacBlanco

Just for some continuity on the discussion of sketch serialized format compatibility, we brought this to their community in hopes to clarify the guarantees that the maintainers provide. This resulted in some changes to the datasketches documentation in https://github.com/apache/datasketches-website/pull/162 and subsequently https://github.com/apache/datasketches-website/pull/163

Mar 04 '24 18:03 ZacBlanco

Possible follow up: if KLL has a provable error bound, could we use it to replace Q-Digest inside of approx_distinct? The only reason approx_distinct uses Q-Digest under the hood, in spite of its poor performance compared to T-Digest, is it has a provable error bound.

Mar 04 '24 18:03 tdcmeehan

I'm not familiar with the code for approx_distinct, but if the KLL sketch answers the same question as the Q-digest (e.g. what percentile does value X fall? or the inverse), then we should be able to replace it. Another option is you could replace the approx_distinct implementation entirely with a Theta sketch. I can't speak how the accuracy of a Theta sketch compares to the current approx_distinct implementation though.

This could be a good project/experiment for a new Presto contributor

Mar 04 '24 19:03 ZacBlanco

The binary format is defined here: https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/kll/KllPreambleUtil.java

Obviously the format for items cannot be defined for arbitrary type T, and the exact size of the levels array is not fixed. But the serialized header is defined, as is the order of the non-fixed-size/format data.

Mar 04 '24 19:03 jmalkin

@Yuhta, I don't understand your comment above (Jan 4th) "The serialized form is not very compact and can occupy more memory than it needs."

The KLL sketch always serializes (toByteArray()) into a compact form, which has no wasted space, and grows very gradually based on the number of items fed to and retained by the sketch. You can see the effect of this in the plot copied from our website above. The blue curve starts very small (8 bytes) and grows gradually until it reaches ~2.5K bytes at about 1M items.

It is true that if the sketch image were compressed it would become even smaller, but that can be costly in terms of time. That is not an option we currently offer as part of the sketch, but as long as one is using lossless compression algorithms, it is something you could do yourself.

If you could elaborate you concern a bit more it would help us understand the issue.

Thanks!

Mar 04 '24 21:03 leerho

presto presto copied to clipboard

Add KLL Sketch Functions

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

presto
presto copied to clipboard