trino
trino copied to clipboard
Support arbitrary aggregation functions during ANALYZE (v2)
(an alternative to https://github.com/trinodb/trino/pull/14220, maintaining backwards compatibility)
A connector may ask engine to collect anything defined by ColumnStatisticType SPI enum. This is convenient, but sometimes a connector needs to provide its own way of calculating statistics.
For example, Iceberg statistics include
apache-datasketches-theta-v1blob typeA serialized form of a "compact" Theta sketch produced by the Apache DataSketches library. The sketch is obtained by constructing Alpha family sketch with default seed, and feeding it with individual distinct values converted to bytes using Iceberg's single-value serialization.
This has two components which are not supported today
- a new data sketch with a specific configuration (so that results can be shared with different query engines)
- a well-defined input pre-processing, which relies on existing Iceberg concepts, which are alien to Trino engine
This PR addresses the first limitation. It allows the connector to pick an aggregation function of its choice for statistics collection.
I think you don't need to, currently.
CI https://github.com/trinodb/trino/issues/14239
Can all the existing StatsiticsTypes be defined as aggregations? I'm just thinking if the statisticsType field of ColumnStatisticMetadata can be broadened to an aggregation instead of needing to support both separately.
@alexjo2144 this is exactly what i did initially, i.e. in https://github.com/trinodb/trino/pull/14220
- It's nice, because it allows to decomission the old statistics definition mechanism quickly.
- it also makes sure the new mechanism is very well exercised
however
- it is non-backwards compatible change
- it inflates PR scope (https://github.com/trinodb/trino/pull/14220 modifies 42 files while this modifies only 10)
- it requires "publishing" internal aggregations like
$internal$max_data_size_for_statsand$internal$sum_data_size_for_stats- and these should eventually be removed, but this requires
- being able to compose aggregation like sum + datasize, requires eg https://github.com/trinodb/trino/pull/14222
- implementing something like
estimated_data_sizescalar function, equivalent ofio.trino.spi.block.Block#getEstimatedDataSizeForStats- i started doing this but this is a piece of work; a scalar cannot just call
Block#getEstimatedDataSizeForStatsmethod
- i started doing this but this is a piece of work; a scalar cannot just call
- and these should eventually be removed, but this requires
Thus
- i think we should eventually remove `ColumnStatisticType
- i think we should eventually remove/decompose
$internal$max_data_size_for_statsand$internal$sum_data_size_for_statsas described above - I think these two things should be follow-ups, not to block the main objective -- being able to calculate Theta sketches for Iceberg NDV stats