presto
presto copied to clipboard
Add KLL Sketch Functions
Description
This change adds KLL Sketch support from the Apache DataSketches library. One of the benefits of KLL sketches is that the implementation supports more datatypes than the existing quantile digest and T-Digest implementations. It also has provably-bounded error compared to the T-Digest
The following datatype is added
-
kllsketch(T)
: Basically a wrapper aroundvarbinary
, similar totdigest
andqdigest
types
The following functions are added
-
sketch_kll
: computes a KLL sketch -
sketch_kll_with_k
: computes a kll sketch with the given value for k -
sketch_kll_quantile
: queries the sketch for a particular quantile -
sketch_kll_rank
: queries the sketch for a particular rank (inverse of the quantile function)
Motivation and Context
Eventual use by the optimizer after #21236. But also good to have since HMS is releasing support for them soon.
Impact
Two new aggregation functions, two new scalar functions
Test Plan
Standard aggregation unit tests and scalar function unit tests
Contributor checklist
- [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
- [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
- [x] If release notes are required, they follow the release notes guidelines.
- [x] Adequate tests were added if applicable.
- [x] CI passed.
Release Notes
== RELEASE NOTES ==
General Changes
* New support for Apache DataSketches KLL sketch with the sketch_kll and related family of functions
Codenotify: Notifying subscribers in CODENOTIFY files for diff 0907cd8374b674783e6ab4415eca096dcbae8737...4df29adc1728d3051ea4cb07ca083c704804a871.
Notify | File(s) |
---|---|
@steveburnett | presto-docs/src/main/sphinx/functions/sketch.rst presto-docs/src/main/sphinx/language/types.rst |
I think we need to document the serialized binary format, similar to T-Digest and QDigest. Otherwise, they're completely unreadable by any other system.
I think we need to document the serialized binary format, similar to T-Digest and QDigest.
The serialized binary format is specified by the Apache Datasketches library. This is already noted in the docs changes.
https://github.com/prestodb/presto/pull/21568/files#diff-33bc3be2c487c146b213ef7b64ddc7c91956aa6e8b5d6207b217061312286b37R47-R48
If this is confusing I can try to update the phrasing
I can't find where they explicitly document the binary format--is there a link for that?
The binary format is not documented unfortunately. However, all versions (cpp, java, go, etc) of the datasketches libraries can serialize and deserialize the sketches. e.g. java can deserialize cpp serialized sketches, and vice versa. AFAIK this is the case for all sketches coming from the datasketches library.
Just for some continuity on the discussion of sketch serialized format compatibility, we brought this to their community in hopes to clarify the guarantees that the maintainers provide. This resulted in some changes to the datasketches documentation in https://github.com/apache/datasketches-website/pull/162 and subsequently https://github.com/apache/datasketches-website/pull/163
Possible follow up: if KLL has a provable error bound, could we use it to replace Q-Digest inside of approx_distinct? The only reason approx_distinct uses Q-Digest under the hood, in spite of its poor performance compared to T-Digest, is it has a provable error bound.
I'm not familiar with the code for approx_distinct
, but if the KLL sketch answers the same question as the Q-digest (e.g. what percentile does value X fall? or the inverse), then we should be able to replace it. Another option is you could replace the approx_distinct
implementation entirely with a Theta sketch. I can't speak how the accuracy of a Theta sketch compares to the current approx_distinct
implementation though.
This could be a good project/experiment for a new Presto contributor
The binary format is defined here: https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/kll/KllPreambleUtil.java
Obviously the format for items cannot be defined for arbitrary type T, and the exact size of the levels array is not fixed. But the serialized header is defined, as is the order of the non-fixed-size/format data.
@Yuhta, I don't understand your comment above (Jan 4th) "The serialized form is not very compact and can occupy more memory than it needs."
The KLL sketch always serializes (toByteArray()
) into a compact form, which has no wasted space, and grows very gradually based on the number of items fed to and retained by the sketch. You can see the effect of this in the plot copied from our website above. The blue curve starts very small (8 bytes) and grows gradually until it reaches ~2.5K bytes at about 1M items.
It is true that if the sketch image were compressed it would become even smaller, but that can be costly in terms of time. That is not an option we currently offer as part of the sketch, but as long as one is using lossless compression algorithms, it is something you could do yourself.
If you could elaborate you concern a bit more it would help us understand the issue.
Thanks!