gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Get streaming hash aggregate back

Open adam8157 opened this issue 3 years ago • 2 comments
trafficstars

In multi-phase aggregate, this feature streams entries of the bottom phase when out of memory instead of spilling to disk, it avoids the disk I/O and roughly deduplicates to save the network I/O.

The planner always supports streaming aggregate and there are lots of cases, but the executor part was lost while merging PostgreSQL 12, this commit gets it back.

adam8157 avatar Aug 17 '22 08:08 adam8157

The planner always supports streaming aggregate and there are lots of cases, but the executor just did nothing special.

adam8157 avatar Aug 17 '22 09:08 adam8157

create table dqa_t1 (d int, i int, c char, dt date);
insert into dqa_t1 select i%23, i%12, (i%10) || '', '2009-06-10'::date + ( (i%34) || ' days')::interval
from generate_series(0, 99999999) i;

SET statement_mem='1000kB';
explain analyze select count(distinct i) from dqa_t1 group by c;
gpadmin=# explain select count(distinct i) from dqa_t1 group by c;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..9710.67 rows=10 width=8)
   ->  HashAggregate  (cost=0.00..9710.67 rows=4 width=8)
         Group Key: c
         ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..9710.67 rows=23 width=6)
               Hash Key: c
               ->  GroupAggregate  (cost=0.00..9710.67 rows=23 width=6)
                     Group Key: c, i
                     ->  Sort  (cost=0.00..9710.67 rows=23 width=6)
                           Sort Key: c, i
                           ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..9710.66 rows=23 width=6)
                                 Hash Key: c, i
                                 ->  Streaming HashAggregate  (cost=0.00..9710.66 rows=23 width=6)
                                       Group Key: c, i
                                       ->  Seq Scan on dqa_t1  (cost=0.00..1237.68 rows=33333680 width=6)
 Optimizer: Pivotal Optimizer (GPORCA)
(15 rows)

The above case generates a streaming plan, but has no big duration difference between streaming execution and not.

adam8157 avatar Aug 29 '22 07:08 adam8157