[Bug] orca: Inner Hashjoin failed with Partition selector
Apache Cloudberry version
No response
What happened
The original discussion about this issue is https://github.com/apache/cloudberry/pull/807#issuecomment-2563235169
CREATE TABLE t_clientinstrumentind2 (
tradingday text,
client_id INT,
instrumentid text,
PRIMARY KEY (tradingday, client_id, instrumentid)
)
DISTRIBUTED BY (tradingday, client_id, instrumentid)
PARTITION BY RANGE (tradingday)
(
PARTITION p2019 START ('20190101'::character varying(8)) END ('20200101'::character varying(8)) WITH (tablename='t_clientinstrumentind_2_prt_p2019', appendonly=false),
PARTITION p2020 START ('20200101'::character varying(8)) END ('20210101'::character varying(8)) WITH (tablename='t_clientinstrumentind_2_prt_p2020', appendonly=false)
);
CREATE TABLE t_clientproductind2 (
tradingday character varying(8),
productid TEXT,
clientid INT,
exchangegroup TEXT,
customertype INT ,
PRIMARY KEY (tradingday, productid, clientid, exchangegroup, customertype)
)
DISTRIBUTED BY (tradingday, productid, clientid, exchangegroup, customertype)
PARTITION BY RANGE (tradingday)
(
PARTITION p2019 START ('20190101'::character varying(8)) END ('20200101'::character varying(8)) WITH (tablename='t_clientproductind_2_prt_p2019', appendonly=false),
PARTITION p2020 START ('20200101'::character varying(8)) END ('20210101'::character varying(8)) WITH (tablename='t_clientproductind_2_prt_p2020', appendonly=false)
);
INSERT INTO t_clientinstrumentind2 (tradingday, client_id, instrumentid) VALUES
('20190715', 54982370, 'al1908'),
('20190715', 54982370, 'rb2001'),
('20190715', 54982370, 'cu1909'),
('20190715', 54982370, 'cu1908'),
('20190715', 54982370, 'zn1908'),
('20190715', 54982370, 'pb1908');
INSERT INTO t_clientproductind2 (tradingday, productid, clientid, exchangegroup, customertype) VALUES
('20190715', 'cu_f', 54982370, 'SHFE', 1),
('20190715', 'rb_f', 54982370, 'SHFE', 1),
('20190715', 'al_f', 54982370, 'SHFE', 1),
('20190715', 'zn_f', 54982370, 'SHFE', 1),
('20190715', 'pb_f', 54982370, 'SHFE', 1);
explain SELECT
*
FROM(
SELECT
tradingday,
1 AS ins_SpanInsArbitrageRatio FROM
t_clientinstrumentind2 t WHERE
t.tradingday BETWEEN '20190715'AND'20190715' GROUP BY
t.tradingday
)t1
INNER JOIN
(
SELECT
t.tradingday,
0.9233716475 AS prod_SpanInsArbitrageRatio FROM
t_clientproductind2 t WHERE
t.tradingday BETWEEN'20190715'AND '20190715' GROUP BY
t.tradingday
)t2 ON t1.tradingday = t2.tradingday;
orca cannot generate a plan for it.
What you think should happen instead
No response
How to reproduce
run the sql above
Operating System
centos7
Anything else
No response
Are you willing to submit PR?
- [X] Yes, I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct.
Step into this issue and found the root cause why the orca cannot generate a plan for it:
ROOT
Group 28 (#GExprs: 5):
0: CLogicalNAryJoin [ 13 26 27 ]
1: CLogicalInnerJoin [ 13 26 27 ]
2: CLogicalInnerJoin [ 26 13 27 ]
3: CPhysicalInnerHashJoin (High) [ 26 13 27 ]
Cost Ctxts:
4: CPhysicalInnerHashJoin (High) [ 13 26 27 ]
Cost Ctxts:
Grp OptCtxts:
0 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [SINGLETON (master) match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:
From the root group, we can see it uses inner hashjoin.
Group 26 (#GExprs: 5):
0: CLogicalProject [ 22 25 ]
1: CPhysicalComputeScalar [ 22 25 ]
Cost Ctxts:
main ctxt (stage 0)6.0, child ctxts:[1], rows:1.000000 (group), cost: 6.000413
main ctxt (stage 0)6.1, child ctxts:[3], rows:1.000000 (group), cost: 6.000413
main ctxt (stage 0)2.1, child ctxts:[0], rows:1.000000 (group), cost: 6.000502
main ctxt (stage 0)4.1, child ctxts:[2], rows:1.000000 (group), cost: 6.000842
main ctxt (stage 0)1.1, child ctxts:[1], rows:1.000000 (group), cost: 6.000413
2: CPhysicalPartitionSelector, Id: 0, Scan Id: 1, Part Table: (6.17018.1.0) [ 26 ]
Cost Ctxts:
main ctxt (stage 0)5.0, child ctxts:[6], rows:1.000000 (group), cost: 6.000429
main ctxt (stage 0)0.0, child ctxts:[2], rows:1.000000 (group), cost: 6.000518
main ctxt (stage 0)3.0, child ctxts:[4], rows:1.000000 (group), cost: 6.000858
3: CPhysicalMotionGather(master) [ 26 ]
Cost Ctxts:
main ctxt (stage 0)2.0, child ctxts:[1], rows:1.000000 (group), cost: 6.000591
4: CPhysicalMotionBroadcast [ 26 ]
Cost Ctxts:
main ctxt (stage 0)4.0, child ctxts:[1], rows:1.000000 (group), cost: 6.001272
Grp OptCtxts:
0 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [SINGLETON (master) match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [propagator<1>({}) match: satisfy ]) => Best Expr:2
1 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [ANY EOperatorId: 128 match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:1
2 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [SINGLETON (master) match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:1
3 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [REPLICATED match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [propagator<1>({}) match: satisfy ]) => Best Expr:2
4 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [REPLICATED match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:1
5 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [HASHED: [ CScalarIdent "tradingday" (12), nulls colocated ], opfamilies: (0.1995.1.0), match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [propagator<1>({}) match: satisfy ]) => Best Expr:2
6 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [HASHED: [ CScalarIdent "tradingday" (12), nulls colocated ], opfamilies: (0.1995.1.0), match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:1
7 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [SINGLETON (master) match: exact], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [consumer<2>({10}) match: satisfy ]) => Best Expr:
8 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [NON-SINGLETON match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [consumer<2>({10}) match: satisfy ]) => Best Expr:
9 (stage 0): (req CTEs: [], req order: [<empty> match: satisfy ], req dist: [HASHED: [ CScalarIdent "tradingday" (12), nulls colocated ], opfamilies: (0.1995.1.0), match: exact], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [consumer<2>({10}) match: satisfy ]) => Best Expr:
From the group 26 we can see that the 'partition selector' is added, but in the cost context 7,8 and 9 , no best expr is found. And in the 3 cost contexts , the partition propagation: consumer are not empty.
Group 21 (#GExprs: 7):
0: CLogicalSelect [ 14 20 ]
1: CLogicalSelect [ 29 20 ]
2: CPhysicalFilter [ 29 20 ]
Cost Ctxts:
main ctxt (stage 0)1.1, child ctxts:[0], rows:1.000000 (group), cost: 6.000371
3: CPhysicalFilter [ 14 20 ]
Cost Ctxts:
main ctxt (stage 0)8.1, child ctxts:[2], rows:1.000000 (group), cost: 431.000028
main ctxt (stage 0)1.1, child ctxts:[0], rows:1.000000 (group), cost: 431.000028
4: CPhysicalMotionHashDistribute HASHED: [ CScalarIdent "tradingday" (12), nulls colocated ], opfamilies: (0.1995.1.0), [ 21 ]
Cost Ctxts:
main ctxt (stage 0)0.0, child ctxts:[1], rows:1.000000 (group), cost: 6.000384
5: CPhysicalMotionRandom [ 21 ]
Cost Ctxts:
main ctxt (stage 0)2.0, child ctxts:[1], rows:1.000000 (group), cost: 6.000380
6: CPhysicalSort ( (0.664.1.0), "tradingday" (12), NULLsLast ) [ 21 ]
Cost Ctxts:
main ctxt (stage 0)5.0, child ctxts:[0], rows:1.000000 (group), cost: 6.000384
main ctxt (stage 0)3.0, child ctxts:[2], rows:1.000000 (group), cost: 6.000380
main ctxt (stage 0)10.0, child ctxts:[8], rows:1.000000 (group), cost: 431.000028
main ctxt (stage 0)4.0, child ctxts:[1], rows:1.000000 (group), cost: 6.000371
In group 21, motions are added. (group 21 is the child group of group 26)
in function CPhysicalInnerHashJoin::PppsRequired , it firstly derives the partition properties from child and then check if the join predicate containing the partition properties, if yes, then add partition selector operator to its children.
But in the below code
CPartitionPropagationSpec *
CPhysicalMotion::PppsRequired(CMemoryPool *mp, CExpressionHandle &,
CPartitionPropagationSpec *, ULONG,
CDrvdPropArray *, ULONG) const
{
// A motion is a hard barrier for partition propagation since it executes in a
// different slice; and thus it cannot require this property from its child
return GPOS_NEW(mp) CPartitionPropagationSpec(mp);
}
the motion node will not propagate any partition info to its parent nodes. Then the partition selector cannot work
when function CPhysicalInnerHashJoin::PppsRequired called, the enforce node are not added yet, it can derive the partition properties from its current children sucessfully, but when any motion node is added below the partition selector, the plan cannot be generated.
Work around:
set optimizer_enable_partition_selection=off;
this can make the orca work for the above query.
But better to resolve it without setting the GUC manually. If we can know there will be any motion node added below the inner hashjoin, we can do it.