gpdb Get streaming hash aggregate back

trafficstars

In multi-phase aggregate, this feature streams entries of the bottom phase when out of memory instead of spilling to disk, it avoids the disk I/O and roughly deduplicates to save the network I/O.

The planner always supports streaming aggregate and there are lots of cases, but the executor part was lost while merging PostgreSQL 12, this commit gets it back.

Aug 17 '22 08:08 adam8157

The planner always supports streaming aggregate and there are lots of cases, but the executor just did nothing special.

Aug 17 '22 09:08 adam8157

create table dqa_t1 (d int, i int, c char, dt date);
insert into dqa_t1 select i%23, i%12, (i%10) || '', '2009-06-10'::date + ( (i%34) || ' days')::interval
from generate_series(0, 99999999) i;

SET statement_mem='1000kB';
explain analyze select count(distinct i) from dqa_t1 group by c;

gpadmin=# explain select count(distinct i) from dqa_t1 group by c;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..9710.67 rows=10 width=8)
   ->  HashAggregate  (cost=0.00..9710.67 rows=4 width=8)
         Group Key: c
         ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..9710.67 rows=23 width=6)
               Hash Key: c
               ->  GroupAggregate  (cost=0.00..9710.67 rows=23 width=6)
                     Group Key: c, i
                     ->  Sort  (cost=0.00..9710.67 rows=23 width=6)
                           Sort Key: c, i
                           ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..9710.66 rows=23 width=6)
                                 Hash Key: c, i
                                 ->  Streaming HashAggregate  (cost=0.00..9710.66 rows=23 width=6)
                                       Group Key: c, i
                                       ->  Seq Scan on dqa_t1  (cost=0.00..1237.68 rows=33333680 width=6)
 Optimizer: Pivotal Optimizer (GPORCA)
(15 rows)

The above case generates a streaming plan, but has no big duration difference between streaming execution and not.

Aug 29 '22 07:08 adam8157

gpdb gpdb copied to clipboard

Get streaming hash aggregate back

gpdb
gpdb copied to clipboard