gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

DISTINCT gives incorrect result in array_agg

Open orhankislal opened this issue 3 years ago • 2 comments
trafficstars

Bug Report

Greenplum version or build

PostgreSQL 12beta2 (Greenplum Database 7.0.0-alpha.0+dev.15588.gfad65d7 build commit:fad65d796f7d2b7c17884d67ed7b79b216f11a71) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit compiled on Jul 30 2022 08:03:10

OS version and uname -a

CentOS Linux release 7.9.2009 Linux 9175ca1f-1cd7-47df-4867-6ab7cbcce4b4 4.15.0-142-generic #146~16.04.1-Ubuntu SMP Tue Apr 13 09:27:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

gcr.io/data-gpdb-public-images/gpdb7-centos7-build

autoconf options used ( config.status --config )

Concourse bin_gpdb: gs://pivotal-gpdb-concourse-resources-prod/server/published/master/server-rc-7.0.0-alpha.0+dev.15588.gfad65d7-rhel7_x86_64.tar.gz

Step to reproduce the behavior

DROP TABLE IF EXISTS abalone_special_char2;
CREATE TABLE abalone_special_char2 (
    id serial,
    sex character varying,
    length double precision,
    diameter double precision,
    height double precision,
    class bigint
);
COPY abalone_special_char2 (sex, length, diameter, height, class) FROM stdin WITH DELIMITER '|' NULL as '@';
FFa|0.475|0.37|0.125|2
FFs|0.55|0.44|0.15|0
FFd|0.565|0.44|0.155|2
MMa|0.44|0.365|0.125|0
MMs|0.475|0.37|0.125|2
MMd|0.47|0.355|0.100|1
F|0.55|0.44|0.15|0
\.

SELECT array_agg(DISTINCT (class)::TEXT) AS class, array_agg(DISTINCT (id)::TEXT) AS id,
array_agg(DISTINCT (sex)::TEXT) AS sex
  FROM abalone_special_char2;

Expected behavior

class should have 0,1,2 in some order

Actual behavior

class has duplicate values

   class   |       id        |             sex             
-----------+-----------------+-----------------------------
 {2,1,0,2} | {4,2,1,3,5,6,7} | {F,MMd,FFa,FFs,MMa,FFd,MMs}

Similar queries:

Not casting to TEXT:

SELECT array_agg(DISTINCT (class)) AS class, array_agg(DISTINCT (id)::TEXT) AS id,
array_agg(DISTINCT (sex)::TEXT) AS sex
  FROM abalone_special_char2;
    class    |       id        |             sex             
-------------+-----------------+-----------------------------
 {0,1,2,2,2} | {1,3,5,6,7,4,2} | {FFd,MMs,F,MMd,FFa,FFs,MMa}

Remove sex array_agg (correct result):

SELECT array_agg(DISTINCT (class)) AS class, array_agg(DISTINCT (id)::TEXT) AS id 
 FROM abalone_special_char2;         
 class  |       id        
---------+-----------------
{2,0,1} | {7,1,2,3,4,6,5}

Remove id array_agg:

SELECT array_agg(DISTINCT (class)) AS class,                                      
array_agg(DISTINCT (sex)::TEXT) AS sex
  FROM abalone_special_char2;
    class    |             sex             
-------------+-----------------------------
 {1,2,0,2,2} | {FFa,FFd,F,FFs,MMa,MMd,MMs}

If the table is distributed on class, it works correctly. I was not able to repro this locally on Mac but it consistently happens on concourse.

orhankislal avatar Sep 09 '22 18:09 orhankislal

Sees that there's a bug with our planner? As a work around, we can disable gp_enable_multiphase_agg temporarily.

postgres=# show gp_enable_multiphase_agg;
 gp_enable_multiphase_agg
--------------------------
 on
(1 row)

postgres=# explain SELECT array_agg(DISTINCT class) AS class, array_agg(DISTINCT sex::TEXT) as sex FROM abalone_special_char2;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=298.92..298.93 rows=1 width=64)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=298.84..298.89 rows=3 width=64)
         ->  Partial Aggregate  (cost=298.84..298.85 rows=1 width=64)
               ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=228.83..288.83 rows=2000 width=40)
                     Hash Key: class, sex, (AggExprId)
                     ->  Streaming HashAggregate  (cost=228.83..248.83 rows=2000 width=40)
                           Group Key: AggExprId, class, ((sex)::text)
                           ->  TupleSplit  (cost=145.33..173.17 rows=22267 width=40)
                                 Split by Col: (class), (((sex)::text))
                                 ->  Seq Scan on abalone_special_char2  (cost=0.00..145.33 rows=11133 width=40)
 Optimizer: Postgres query optimizer
(11 rows)

postgres=# SELECT array_agg(DISTINCT class) AS class, array_agg(DISTINCT sex::TEXT) as sex FROM abalone_special_char2;
    class    |             sex
-------------+-----------------------------
 {0,2,0,1,2} | {FFa,FFd,F,FFs,MMa,MMd,MMs}
(1 row)

postgres=# set gp_enable_multiphase_agg='off';
SET
postgres=# explain SELECT array_agg(DISTINCT class) AS class, array_agg(DISTINCT sex::TEXT) as sex FROM abalone_special_char2;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Aggregate  (cost=757.67..757.68 rows=1 width=64)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..590.67 rows=33400 width=40)
         ->  Seq Scan on abalone_special_char2  (cost=0.00..145.33 rows=11133 width=40)
 Optimizer: Postgres query optimizer
(4 rows)

postgres=# SELECT array_agg(DISTINCT class) AS class, array_agg(DISTINCT sex::TEXT) as sex FROM abalone_special_char2;
  class  |             sex
---------+-----------------------------
 {0,1,2} | {F,FFa,FFd,FFs,MMa,MMd,MMs}
(1 row)

Update: Postgres doesn't support parallel aggreation when there's DISTINCT or ORDER BY clause in aggregate function[1]. The plan generated by our planner doesn't seem quite right. Maybe we should disable multi-phase agg for such queries as well?

[1] https://www.postgresql.org/docs/current/parallel-plans.html#:~:text=PostgreSQL%20supports%20parallel%20aggregation%20by,as%20a%20Partial%20Aggregate%20node.

higuoxing avatar Sep 11 '22 09:09 higuoxing

The hash key of motion is suspicious.

# explain SELECT array_agg(DISTINCT (class)::TEXT) AS class, array_agg(DISTINCT (id)::TEXT) AS id,
array_agg(DISTINCT (sex)::TEXT) AS sex
  FROM abalone_special_char2;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=399.28..399.29 rows=1 width=96)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=399.17..399.22 rows=3 width=96)
         ->  Partial Aggregate  (cost=399.17..399.18 rows=1 width=96)
               ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=256.67..346.67 rows=3000 width=44)
                     Hash Key: ((class)::text), ((id)::text), sex, (AggExprId)    <----------------------------------    suspicious
                     ->  Streaming HashAggregate  (cost=256.67..286.67 rows=3000 width=44)
                           Group Key: AggExprId, ((class)::text), ((id)::text), ((sex)::text)
                           ->  TupleSplit  (cost=145.33..173.17 rows=33400 width=44)
                                 Split by Col: (((class)::text)), (((id)::text)), (((sex)::text))
                                 ->  Seq Scan on abalone_special_char2  (cost=0.00..145.33 rows=11133 width=44)
 Optimizer: Postgres query optimizer
(11 rows)

adam8157 avatar Sep 13 '22 08:09 adam8157

Another easy to reproduce test case:

postgres=# create table dqa_f3(a character varying, b bigint) distributed by (a);
insert into dqa_f3 values ('123', 2), ('213', 0), ('231', 2), ('312', 0), ('321', 2), ('132', 1), ('4', 0);
CREATE TABLE
INSERT 0 7
postgres=# select count(distinct (b)::text) as b, count(distinct (a)::text) as a from dqa_f3;
 b | a
---+---
 4 | 7
(1 row)

postgres=# explain (verbose, costs off) select count(distinct (b)::text) as b, count(distinct (a)::text) as a from dqa_f3;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(DISTINCT ((b)::text)), count(DISTINCT ((a)::text))
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(DISTINCT ((b)::text))), (PARTIAL count(DISTINCT ((a)::text)))
         ->  Partial Aggregate
               Output: PARTIAL count(DISTINCT ((b)::text)), PARTIAL count(DISTINCT ((a)::text))
               ->  Redistribute Motion 3:3  (slice2; segments: 3)
                     Output: b, a, ((b)::text), ((a)::text), (AggExprId)
                     Hash Key: ((b)::text), a, (AggExprId)
                     ->  Streaming HashAggregate
                           Output: b, a, ((b)::text), ((a)::text), (AggExprId)
                           Group Key: AggExprId, ((dqa_f3.b)::text), ((dqa_f3.a)::text)
                           ->  TupleSplit
                                 Output: b, a, ((b)::text), ((a)::text), AggExprId
                                 Split by Col: (((dqa_f3.b)::text)), (((dqa_f3.a)::text))
                                 ->  Seq Scan on public.dqa_f3
                                       Output: b, a, (b)::text, (a)::text
 Optimizer: Postgres query optimizer
(18 rows)

higuoxing avatar Sep 22 '22 00:09 higuoxing