iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Push down group by for partition columns

Open huaxingao opened this issue 2 years ago • 3 comments

Push down min/max/count with group by if group by is on partition columns

For example:

CREATE TABLE test (id LONG, ts TIMESTAMP, data INT) USING iceberg PARTITIONED BY (id, ts);
SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, ts;

I have the following changes:

  1. Added Expression GroupBy, UnboundGroupBy, BoundGroupBy

  2. In Aggregator, I will take consideration of GroupBy. Use MaxAggregator as an example:

    • Add List<BoundGroupBy> in the constructor
    • Add private Map<StructLike, T> maxP which contains max value for each of the partition.
    • The key of the above Map is AggregateEvaluator.ArrayStructLike, which contains the partition values.
    • Add update(R value, StructLike partitionKey) to update value for specific partitions
    • Add Map<StructLike, R> currentPartition() to get current value for each partitions
  3. Build the Scan schema to be: group by columns + aggregates (this is what Spark is expecting) Use SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, ts as an example: the schema is LONG, TIMESTAMP, INT, INT, LONG, LONG, which corresponds to id, ts, MIN(data), MAX(data), COUNT(data), COUNT(*)

huaxingao avatar Mar 02 '23 04:03 huaxingao

cc @rdblue @aokolnychyi

huaxingao avatar Mar 02 '23 17:03 huaxingao

Oops, I didn't mean to close this! I want to work on getting it in next

rdblue avatar Feb 02 '24 16:02 rdblue

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Aug 28 '24 00:08 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Sep 04 '24 00:09 github-actions[bot]