feat: optimize and unparse grouping
Which issue does this PR close?
Rationale for this change
first, it seems that grouping udaf document is not correct.
and for aggregation with grouping,
e.g.
CREATE TABLE test (c1 VARCHAR,c2 VARCHAR,c3 INT) as values
('a','A',1), ('b','B',2);
EXPLAIN FORMAT INDENT select
c1,
c2,
CASE WHEN grouping(c1) = 1 THEN sum(c3) ELSE NULL END as gx,
grouping(c1) as g0,
grouping(c2) as g1,
grouping(c1, c2) as g2,
grouping(c2, c1) as g3
from
test
group by
grouping sets (
(c1, c2),
(c1),
(c2),
()
);
current logical plan is:
| logical_plan | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3 |
| | Projection: test.c1, test.c2, CAST(__common_expr_1 AS Int32) AS grouping(test.c1), sum(test.c3), CAST(__common_expr_2 AS Int32) AS grouping(test.c2), CAST(__grouping_id AS Int32) AS grouping(test.c1,test.c2), CAST(__common_expr_1 | __common_expr_2 << UInt8(1) AS Int32) AS grouping(test.c2,test.c1) |
| | Projection: __grouping_id & UInt8(2) >> UInt8(1) AS __common_expr_1, __grouping_id & UInt8(1) AS __common_expr_2, test.c1, test.c2, __grouping_id, sum(test.c3) |
| | Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]] |
| | TableScan: test projection=[c1, c2, c3]
the problems are:
-
there are three projections. for bitwise operation, there's no benifit for extra projection.
-
it makes grouping level optimization very hard. for example,
select case when grouping(c1) = 1 then sum(c2) else null end from test group by grouping sets (...)we only need to calculate sum(c2) in when grouping(c1) = 1, this is a very useful optimization trick. I'm also using this trick to optimizing sql with multiple count distinct.
-
unparsing is not supported. Internal("Tried to unproject column referring to internal grouping id") will be thrown.
What changes are included in this PR?
- create a grouping udf.
- modify analyzer, replace grouping_udaf with grouping udf
- change it back, when unparsing.
now, the logical plan is:
| logical_plan | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3 |
| | Projection: test.c1, test.c2, grouping(__grouping_id, List([1])) AS grouping(test.c1), sum(test.c3), grouping(__grouping_id, List([0])) AS grouping(test.c2), grouping(__grouping_id) AS grouping(test.c1,test.c2), grouping(__grouping_id, List([0, 1])) AS grouping(test.c2,test.c1) |
| | Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]] |
| | TableScan: test projection=[c1, c2, c3]
there are only two projections. and unparsing is supported.
Are these changes tested?
UT
Are there any user-facing changes?
No
Thanks @chenkovsky -- can. you find the original PR that added this GROUPING function and perhaps @ mention the author to see if they have any feedback / could help with review?
@eejbyfeldt @comphead could you please help me review this PR?
it's related to #12704
I did not look closely at yet since I have not really contributed here in months.
there are three projections. for bitwise operation, there's no benifit for extra projection.
This extra projection is introduced by CommonSubexprEliminate. I understand that the changes make it so that they do not interact in the presented query, but is this really a general fix of that behavior?
But the goal of leaving more structure in logical plan after resolving the grouping expr so that optimization and unparsing sounds reasonable to me.
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
keep it open
Thanks @chenkovsky
@eejbyfeldt it looks like you contributed the initial implementation of grouping in https://github.com/apache/datafusion/pull/12704
Could you help review this PR?
Hi @alamb and @chenkovsky! I am running into the same issue @lorenarosati brought up a couple months ago (https://github.com/apache/datafusion/issues/16590) and have not found a workaround solution yet. Just wanted to check in, what's the status of this PR?
Hi @alamb and @chenkovsky! I am running into the same issue @lorenarosati brought up a couple months ago (#16590) and have not found a workaround solution yet. Just wanted to check in, what's the status of this PR?
I am waiting for someone to help review the PR. Can you help here @Slimsammylim ? Perhaps verify that this solution fixes your issue?
Hi, I ran my code using this branch and unfortunately it did not solve my issue (https://github.com/apache/datafusion/issues/16590).
Hi, I ran my code using this branch and unfortunately it did not solve my issue (https://github.com/apache/datafusion/issues/16590).
hi,could you please push your code, then i can see what's your issue. currently i havent touch substrait, some extra codes should be modified to solve issue 16590.
I added substrait support. @Slimsammylim please try again
Hi! I tried again and but I keep getting the same error message.
Hi! I am not running SQL. The type of plan that is failing for me is the same as the one Victor mentions in https://github.com/apache/datafusion/pull/17909. Lorena mentions a similar plan in https://github.com/apache/datafusion/issues/16590.
Hi! I am not running SQL. The type of plan that is failing for me is the same as the one Victor mentions in #17909. Lorena mentions a similar plan in #16590.
@Slimsammylim I see, I only tested roundtrip, you only consume substrait plan. I think the simplest workaround is moving __grouping_id to the end of schema, then index will be correct. but It's too tricky. also there's another issue caused by __grouping_id #16983. It's better to remove __grouping_id. I will work on it
@Slimsammylim I created another PR. #17961 , please take a snapshot. it will not touch grouping udaf in logical plan level. so It will support substrait by design.
In favour of #17961, close this