datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

the speed of `group by cube` is slow, the equivalent `union all` even slower

Open l1t1 opened this issue 1 year ago • 0 comments

Is your feature request related to a problem or challenge?

it took about 280s to group by 4 columns on a table of 1e8 rows

> create table ren as select
((random()*1000)::int%2+1)::int c1,
((random()*1000)::int%100+1)::int c2,
((random()*1000)::int%30+1)::int c3,
((random()*1000)::int%60+1)::int c4,
((random()*1000)::int%5+1)::int c5
from unnest(generate_series(1,1e8::int));
0 row(s) fetched.
Elapsed 317.856 seconds.

> create table rc1c2c3c4 as select c1,c2,c3,c4,sum(1)cnt from ren group by cube(c1,c2,c3,c4);
0 row(s) fetched.
Elapsed 281.501 seconds.

> create table t as select c1,c2,c3,c4,sum(1)cnt from ren group by c1,c2,c3,c4;
0 row(s) fetched.
Elapsed 38.093 seconds.

> create table rc1c2c3c4m as
select c1,c2,c3,c4,sum(1)cnt from ren group by c1,c2,c3,c4 union all
select c1,null c2,null c3,null c4,sum(1)cnt from ren group by c1 union all
select null c1,c2,null c3,null c4,sum(1)cnt from ren group by c2 union all
select null c1,null c2,c3,null c4,sum(1)cnt from ren group by c3 union all
select null c1,null c2,null c3,c4,sum(1)cnt from ren group by c4 union all
select c1,c2,null c3,null c4,sum(1)cnt from ren group by c1,c2 union all
select c1,null c2,c3,null c4,sum(1)cnt from ren group by c1,c3 union all
select c1,null c2,null c3,c4,sum(1)cnt from ren group by c1,c4 union all
select null c1,c2,c3,null c4,sum(1)cnt from ren group by c2,c3 union all
select null c1,c2,null c3,c4,sum(1)cnt from ren group by c2,c4 union all
select null c1,null c2,c3,c4,sum(1)cnt from ren group by c3,c4 union all
select c1,c2,c3,null c4,sum(1)cnt from ren group by c1,c2,c3 union all
select c1,c2,null c3,c4,sum(1)cnt from ren group by c1,c2,c4 union all
select c1,null c2,c3,c4,sum(1)cnt from ren group by c1,c3,c4 union all
select null c1,c2,c3,c4,sum(1)cnt from ren group by c2,c3,c4 union all
select null c1,null c2,null c3,null c4,sum(1)cnt from ren  ;
0 row(s) fetched.
Elapsed 1077.624 seconds.

the test of duckdb is relatively fast, though its group by cube is slower than the equivalent union all

https://github.com/duckdb/duckdb/issues/10450

Describe the solution you'd like

the result of cube should use the results of one group by statement as temp table and query it subsequently. such as https://github.com/duckdb/duckdb/issues/10451 said

create table t as select c1,c2,c3,c4,sum(1)cnt from ren group by c1,c2,c3,c4;
--Run Time (s): real 6.829 user 70.730853 sys 10.280466
create table rc1c2c3c4m3 as 
select * from t union all
select c1,null,null,null,sum(cnt)cnt from t group by c1 union all
select null,c2,null,null,sum(cnt)cnt from t group by c2 union all
select null,null,c3,null,sum(cnt)cnt from t group by c3 union all
select null,null,null,c4,sum(cnt)cnt from t group by c4 union all
select c1,c2,null,null,sum(cnt)cnt from t group by c1,c2 union all
select c1,null,c3,null,sum(cnt)cnt from t group by c1,c3 union all
select c1,null,null,c4,sum(cnt)cnt from t group by c1,c4 union all
select null,c2,c3,null,sum(cnt)cnt from t group by c2,c3 union all
select null,c2,null,c4,sum(cnt)cnt from t group by c2,c4 union all
select null,null,c3,c4,sum(cnt)cnt from t group by c3,c4 union all
select c1,c2,c3,null,sum(cnt)cnt from t group by c1,c2,c3 union all
select c1,c2,null,c4,sum(cnt)cnt from t group by c1,c2,c4 union all
select c1,null,c3,c4,sum(cnt)cnt from t group by c1,c3,c4 union all
select null,c2,c3,c4,sum(cnt)cnt from t group by c2,c3,c4 union all
select null,null,null,null,sum(cnt)cnt from t    ;

Describe alternatives you've considered

No response

Additional context

No response

l1t1 avatar Apr 26 '24 05:04 l1t1