Push down group by for partition columns
Push down min/max/count with group by if group by is on partition columns
For example:
CREATE TABLE test (id LONG, ts TIMESTAMP, data INT) USING iceberg PARTITIONED BY (id, ts);
SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, ts;
I have the following changes:
-
Added Expression
GroupBy,UnboundGroupBy,BoundGroupBy -
In
Aggregator, I will take consideration ofGroupBy. UseMaxAggregatoras an example:- Add
List<BoundGroupBy>in the constructor - Add
private Map<StructLike, T> maxPwhich contains max value for each of the partition. - The key of the above Map is
AggregateEvaluator.ArrayStructLike, which contains the partition values. - Add
update(R value, StructLike partitionKey)to update value for specific partitions - Add
Map<StructLike, R> currentPartition()to get current value for each partitions
- Add
-
Build the Scan schema to be: group by columns + aggregates (this is what Spark is expecting) Use
SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, tsas an example: the schema isLONG, TIMESTAMP, INT, INT, LONG, LONG, which corresponds toid, ts, MIN(data), MAX(data), COUNT(data), COUNT(*)
cc @rdblue @aokolnychyi
Oops, I didn't mean to close this! I want to work on getting it in next
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.