presto icon indicating copy to clipboard operation
presto copied to clipboard

Add KLL Sketch Functions

Open ZacBlanco opened this issue 1 year ago • 5 comments

Description

This change adds KLL Sketch support from the Apache DataSketches library. One of the benefits of KLL sketches is that the implementation supports more datatypes than the existing quantile digest and T-Digest implementations. It also has provably-bounded error compared to the T-Digest

The following datatype is added

  • kllsketch(T): Basically a wrapper around varbinary, similar to tdigest and qdigest types

The following functions are added

  • sketch_kll: computes a KLL sketch
  • sketch_kll_with_k: computes a kll sketch with the given value for k
  • sketch_kll_quantile: queries the sketch for a particular quantile
  • sketch_kll_rank: queries the sketch for a particular rank (inverse of the quantile function)

Motivation and Context

Eventual use by the optimizer after #21236. But also good to have since HMS is releasing support for them soon.

Impact

Two new aggregation functions, two new scalar functions

Test Plan

Standard aggregation unit tests and scalar function unit tests

Contributor checklist

  • [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • [x] If release notes are required, they follow the release notes guidelines.
  • [x] Adequate tests were added if applicable.
  • [x] CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* New support for Apache DataSketches KLL sketch with the sketch_kll and related family of functions

ZacBlanco avatar Dec 18 '23 23:12 ZacBlanco

Codenotify: Notifying subscribers in CODENOTIFY files for diff 0907cd8374b674783e6ab4415eca096dcbae8737...4df29adc1728d3051ea4cb07ca083c704804a871.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/functions/sketch.rst
presto-docs/src/main/sphinx/language/types.rst

github-actions[bot] avatar Dec 18 '23 23:12 github-actions[bot]

I think we need to document the serialized binary format, similar to T-Digest and QDigest. Otherwise, they're completely unreadable by any other system.

tdcmeehan avatar Feb 13 '24 16:02 tdcmeehan

I think we need to document the serialized binary format, similar to T-Digest and QDigest.

The serialized binary format is specified by the Apache Datasketches library. This is already noted in the docs changes.

https://github.com/prestodb/presto/pull/21568/files#diff-33bc3be2c487c146b213ef7b64ddc7c91956aa6e8b5d6207b217061312286b37R47-R48

If this is confusing I can try to update the phrasing

ZacBlanco avatar Feb 13 '24 16:02 ZacBlanco

I can't find where they explicitly document the binary format--is there a link for that?

tdcmeehan avatar Feb 13 '24 20:02 tdcmeehan

The binary format is not documented unfortunately. However, all versions (cpp, java, go, etc) of the datasketches libraries can serialize and deserialize the sketches. e.g. java can deserialize cpp serialized sketches, and vice versa. AFAIK this is the case for all sketches coming from the datasketches library.

ZacBlanco avatar Feb 13 '24 21:02 ZacBlanco

Just for some continuity on the discussion of sketch serialized format compatibility, we brought this to their community in hopes to clarify the guarantees that the maintainers provide. This resulted in some changes to the datasketches documentation in https://github.com/apache/datasketches-website/pull/162 and subsequently https://github.com/apache/datasketches-website/pull/163

ZacBlanco avatar Mar 04 '24 18:03 ZacBlanco

Possible follow up: if KLL has a provable error bound, could we use it to replace Q-Digest inside of approx_distinct? The only reason approx_distinct uses Q-Digest under the hood, in spite of its poor performance compared to T-Digest, is it has a provable error bound.

tdcmeehan avatar Mar 04 '24 18:03 tdcmeehan

I'm not familiar with the code for approx_distinct, but if the KLL sketch answers the same question as the Q-digest (e.g. what percentile does value X fall? or the inverse), then we should be able to replace it. Another option is you could replace the approx_distinct implementation entirely with a Theta sketch. I can't speak how the accuracy of a Theta sketch compares to the current approx_distinct implementation though.

This could be a good project/experiment for a new Presto contributor

ZacBlanco avatar Mar 04 '24 19:03 ZacBlanco

The binary format is defined here: https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/kll/KllPreambleUtil.java

Obviously the format for items cannot be defined for arbitrary type T, and the exact size of the levels array is not fixed. But the serialized header is defined, as is the order of the non-fixed-size/format data.

jmalkin avatar Mar 04 '24 19:03 jmalkin

@Yuhta, I don't understand your comment above (Jan 4th) "The serialized form is not very compact and can occupy more memory than it needs."

The KLL sketch always serializes (toByteArray()) into a compact form, which has no wasted space, and grows very gradually based on the number of items fed to and retained by the sketch. You can see the effect of this in the plot copied from our website above. The blue curve starts very small (8 bytes) and grows gradually until it reaches ~2.5K bytes at about 1M items.

It is true that if the sketch image were compressed it would become even smaller, but that can be costly in terms of time. That is not an option we currently offer as part of the sketch, but as long as one is using lossless compression algorithms, it is something you could do yourself.

If you could elaborate you concern a bit more it would help us understand the issue.

Thanks!

leerho avatar Mar 04 '24 21:03 leerho