datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: optimize and unparse grouping

Open chenkovsky opened this issue 7 months ago • 4 comments

Which issue does this PR close?

Rationale for this change

first, it seems that grouping udaf document is not correct.

and for aggregation with grouping,

e.g.

CREATE TABLE test (c1 VARCHAR,c2 VARCHAR,c3 INT) as values
('a','A',1), ('b','B',2);

EXPLAIN FORMAT INDENT select
  c1,
  c2,
  CASE WHEN grouping(c1) = 1 THEN sum(c3) ELSE NULL END as gx,
  grouping(c1) as g0,
  grouping(c2) as g1,
  grouping(c1, c2) as g2,
  grouping(c2, c1) as g3
from
  test
group by
  grouping sets (
    (c1, c2),
    (c1),
    (c2),
    ()
  );

current logical plan is:

| logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3                                                                                                              |
|               |   Projection: test.c1, test.c2, CAST(__common_expr_1 AS Int32) AS grouping(test.c1), sum(test.c3), CAST(__common_expr_2 AS Int32) AS grouping(test.c2), CAST(__grouping_id AS Int32) AS grouping(test.c1,test.c2), CAST(__common_expr_1 | __common_expr_2 << UInt8(1) AS Int32) AS grouping(test.c2,test.c1)                                       |
|               |     Projection: __grouping_id & UInt8(2) >> UInt8(1) AS __common_expr_1, __grouping_id & UInt8(1) AS __common_expr_2, test.c1, test.c2, __grouping_id, sum(test.c3)                                                                                                                                                                                |
|               |       Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]                                                                                                                                                                                                                    |
|               |         TableScan: test projection=[c1, c2, c3]  

the problems are:

  1. there are three projections. for bitwise operation, there's no benifit for extra projection.

  2. it makes grouping level optimization very hard. for example,

     select  case when grouping(c1) = 1 then sum(c2) else null end from test group by grouping sets (...)
    

    we only need to calculate sum(c2) in when grouping(c1) = 1, this is a very useful optimization trick. I'm also using this trick to optimizing sql with multiple count distinct.

  3. unparsing is not supported. Internal("Tried to unproject column referring to internal grouping id") will be thrown.

What changes are included in this PR?

  1. create a grouping udf.
  2. modify analyzer, replace grouping_udaf with grouping udf
  3. change it back, when unparsing.

now, the logical plan is:

| logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3                                                                             |
|               |   Projection: test.c1, test.c2, grouping(__grouping_id, List([1])) AS grouping(test.c1), sum(test.c3), grouping(__grouping_id, List([0])) AS grouping(test.c2), grouping(__grouping_id) AS grouping(test.c1,test.c2), grouping(__grouping_id, List([0, 1])) AS grouping(test.c2,test.c1)                          |
|               |     Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]                                                                                                                                                                                     |
|               |       TableScan: test projection=[c1, c2, c3]                                                                                                                                                                             

there are only two projections. and unparsing is supported.

Are these changes tested?

UT

Are there any user-facing changes?

No

chenkovsky avatar May 23 '25 07:05 chenkovsky

Thanks @chenkovsky -- can. you find the original PR that added this GROUPING function and perhaps @ mention the author to see if they have any feedback / could help with review?

alamb avatar May 24 '25 11:05 alamb

@eejbyfeldt @comphead could you please help me review this PR?

chenkovsky avatar May 24 '25 12:05 chenkovsky

it's related to #12704

chenkovsky avatar May 24 '25 12:05 chenkovsky

I did not look closely at yet since I have not really contributed here in months.

there are three projections. for bitwise operation, there's no benifit for extra projection.

This extra projection is introduced by CommonSubexprEliminate. I understand that the changes make it so that they do not interact in the presented query, but is this really a general fix of that behavior?

But the goal of leaving more structure in logical plan after resolving the grouping expr so that optimization and unparsing sounds reasonable to me.

eejbyfeldt avatar Jun 17 '25 18:06 eejbyfeldt

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Sep 14 '25 02:09 github-actions[bot]

keep it open

chenkovsky avatar Sep 14 '25 08:09 chenkovsky

Thanks @chenkovsky

@eejbyfeldt it looks like you contributed the initial implementation of grouping in https://github.com/apache/datafusion/pull/12704

Could you help review this PR?

alamb avatar Sep 16 '25 11:09 alamb

Hi @alamb and @chenkovsky! I am running into the same issue @lorenarosati brought up a couple months ago (https://github.com/apache/datafusion/issues/16590) and have not found a workaround solution yet. Just wanted to check in, what's the status of this PR?

Slimsammylim avatar Oct 02 '25 14:10 Slimsammylim

Hi @alamb and @chenkovsky! I am running into the same issue @lorenarosati brought up a couple months ago (#16590) and have not found a workaround solution yet. Just wanted to check in, what's the status of this PR?

I am waiting for someone to help review the PR. Can you help here @Slimsammylim ? Perhaps verify that this solution fixes your issue?

alamb avatar Oct 02 '25 21:10 alamb

Hi, I ran my code using this branch and unfortunately it did not solve my issue (https://github.com/apache/datafusion/issues/16590).

Slimsammylim avatar Oct 03 '25 14:10 Slimsammylim

Hi, I ran my code using this branch and unfortunately it did not solve my issue (https://github.com/apache/datafusion/issues/16590).

hi,could you please push your code, then i can see what's your issue. currently i havent touch substrait, some extra codes should be modified to solve issue 16590.

chenkovsky avatar Oct 03 '25 23:10 chenkovsky

I added substrait support. @Slimsammylim please try again

chenkovsky avatar Oct 04 '25 11:10 chenkovsky

Hi! I tried again and but I keep getting the same error message.

Slimsammylim avatar Oct 06 '25 13:10 Slimsammylim

Hi! I am not running SQL. The type of plan that is failing for me is the same as the one Victor mentions in https://github.com/apache/datafusion/pull/17909. Lorena mentions a similar plan in https://github.com/apache/datafusion/issues/16590.

Slimsammylim avatar Oct 06 '25 15:10 Slimsammylim

Hi! I am not running SQL. The type of plan that is failing for me is the same as the one Victor mentions in #17909. Lorena mentions a similar plan in #16590.

@Slimsammylim I see, I only tested roundtrip, you only consume substrait plan. I think the simplest workaround is moving __grouping_id to the end of schema, then index will be correct. but It's too tricky. also there's another issue caused by __grouping_id #16983. It's better to remove __grouping_id. I will work on it

chenkovsky avatar Oct 07 '25 01:10 chenkovsky

@Slimsammylim I created another PR. #17961 , please take a snapshot. it will not touch grouping udaf in logical plan level. so It will support substrait by design.

chenkovsky avatar Oct 08 '25 10:10 chenkovsky

In favour of #17961, close this

chenkovsky avatar Oct 19 '25 14:10 chenkovsky