trino Support arbitrary aggregation functions during ANALYZE (v2)

(an alternative to https://github.com/trinodb/trino/pull/14220, maintaining backwards compatibility)

A connector may ask engine to collect anything defined by ColumnStatisticType SPI enum. This is convenient, but sometimes a connector needs to provide its own way of calculating statistics.

For example, Iceberg statistics include

apache-datasketches-theta-v1 blob type

A serialized form of a "compact" Theta sketch produced by the Apache DataSketches library. The sketch is obtained by constructing Alpha family sketch with default seed, and feeding it with individual distinct values converted to bytes using Iceberg's single-value serialization.

This has two components which are not supported today

a new data sketch with a specific configuration (so that results can be shared with different query engines)
a well-defined input pre-processing, which relies on existing Iceberg concepts, which are alien to Trino engine

This PR addresses the first limitation. It allows the connector to pick an aggregation function of its choice for statistics collection.

Sep 21 '22 12:09 findepi

I think you don't need to, currently.

Sep 21 '22 15:09 findepi

CI https://github.com/trinodb/trino/issues/14239

Sep 21 '22 15:09 findepi

Can all the existing StatsiticsTypes be defined as aggregations? I'm just thinking if the statisticsType field of ColumnStatisticMetadata can be broadened to an aggregation instead of needing to support both separately.

Sep 21 '22 19:09 alexjo2144

@alexjo2144 this is exactly what i did initially, i.e. in https://github.com/trinodb/trino/pull/14220

It's nice, because it allows to decomission the old statistics definition mechanism quickly.
it also makes sure the new mechanism is very well exercised

however

it is non-backwards compatible change
it inflates PR scope (https://github.com/trinodb/trino/pull/14220 modifies 42 files while this modifies only 10)
it requires "publishing" internal aggregations like $internal$max_data_size_for_stats and $internal$sum_data_size_for_stats
- and these should eventually be removed, but this requires
  - being able to compose aggregation like sum + datasize, requires eg https://github.com/trinodb/trino/pull/14222
  - implementing something like estimated_data_size scalar function, equivalent of io.trino.spi.block.Block#getEstimatedDataSizeForStats
    - i started doing this but this is a piece of work; a scalar cannot just call Block#getEstimatedDataSizeForStats method

Thus

i think we should eventually remove `ColumnStatisticType
i think we should eventually remove/decompose $internal$max_data_size_for_stats and $internal$sum_data_size_for_stats as described above
I think these two things should be follow-ups, not to block the main objective -- being able to calculate Theta sketches for Iceberg NDV stats

Sep 21 '22 19:09 findepi

trino
trino copied to clipboard

Support arbitrary aggregation functions during ANALYZE (v2)

`apache-datasketches-theta-v1` blob type

trino trino copied to clipboard

Support arbitrary aggregation functions during ANALYZE (v2)

apache-datasketches-theta-v1 blob type

trino
trino copied to clipboard

`apache-datasketches-theta-v1` blob type