gpdb
gpdb copied to clipboard
Get streaming hash aggregate back
trafficstars
In multi-phase aggregate, this feature streams entries of the bottom phase when out of memory instead of spilling to disk, it avoids the disk I/O and roughly deduplicates to save the network I/O.
The planner always supports streaming aggregate and there are lots of cases, but the executor part was lost while merging PostgreSQL 12, this commit gets it back.
The planner always supports streaming aggregate and there are lots of cases, but the executor just did nothing special.
create table dqa_t1 (d int, i int, c char, dt date);
insert into dqa_t1 select i%23, i%12, (i%10) || '', '2009-06-10'::date + ( (i%34) || ' days')::interval
from generate_series(0, 99999999) i;
SET statement_mem='1000kB';
explain analyze select count(distinct i) from dqa_t1 group by c;
gpadmin=# explain select count(distinct i) from dqa_t1 group by c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..9710.67 rows=10 width=8)
-> HashAggregate (cost=0.00..9710.67 rows=4 width=8)
Group Key: c
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..9710.67 rows=23 width=6)
Hash Key: c
-> GroupAggregate (cost=0.00..9710.67 rows=23 width=6)
Group Key: c, i
-> Sort (cost=0.00..9710.67 rows=23 width=6)
Sort Key: c, i
-> Redistribute Motion 3:3 (slice3; segments: 3) (cost=0.00..9710.66 rows=23 width=6)
Hash Key: c, i
-> Streaming HashAggregate (cost=0.00..9710.66 rows=23 width=6)
Group Key: c, i
-> Seq Scan on dqa_t1 (cost=0.00..1237.68 rows=33333680 width=6)
Optimizer: Pivotal Optimizer (GPORCA)
(15 rows)
The above case generates a streaming plan, but has no big duration difference between streaming execution and not.