InList is rewritten into = ANY(ARRAY) and leads to bad duckdb plan
What happens?
In List in postgres is represented as = ANY(ARRAY). Duckdb will then turn it into a semi join, which disables many optimizations and performs much worse than a simple inlist, especially when used with joins.
To Reproduce
create TEMP table t(a int)using duckdb;
insert into t values(1);
explain select * from t where a in (1,2);
vscode (pid: 1323) =# explain select * from t where a in (1,2);
QUERY PLAN
------------------------------------------------------------
Custom Scan (DuckDBScan) (cost=0.00..0.00 rows=0 width=0)
DuckDB Execution Plan:
┌───────────────────────────┐
│ HASH_JOIN │
│ ──────────────────── │
│ Join Type: SEMI │
│ Conditions: a = #0 ├──────────────┐
│ │ │
│ ~0 Rows │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ PROJECTION │
│ ──────────────────── ││ ──────────────────── │
│ Table: t ││ UNNEST(list_value(1, 2)) │
│ Type: Sequential Scan ││ │
│ Projections: a ││ │
│ ││ │
│ ~1 Rows ││ ~1 Rows │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNNEST │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ DUMMY_SCAN │
└───────────────────────────┘
(28 rows)
OS:
Mac.OS
pg_duckdb Version (if built from source use commit hash):
latest
Postgres Version (if built from source use commit hash):
17
Hardware:
No response
Full Name:
Zhou Sun
Affiliation:
Mooncake Labs
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
- [x] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- [x] Yes, I have
I think this is hard/impossible to change in pg_duckdb. Sounds like an easier approach would be to have duckdb optimize this query plan better.
I believe it should be doable in pg_duckdb, since pg_duckdb controls the whole to_sql logic via pg_ruleutils
If the node is a ScalarArrayOpExpr and the RHS is a literal array, to_sql it as:
<lhs> IN (<elem1>, <elem2>, ...)
instead of
<lhs> = ANY('{...}'::<type>[])
I'm trying to limit the amount of changes from the upstream pg_ruleutils. So I'd much rather have this optimization added to duckdb, instead of working around it by generating different sql. That way not only pg_duckdb benefits from it as well. So I'd say open a PR to upstream DuckDB that adds this optimization. Or at least report the issue there so that the know an optimization is missing.