pg_duckdb icon indicating copy to clipboard operation
pg_duckdb copied to clipboard

InList is rewritten into = ANY(ARRAY) and leads to bad duckdb plan

Open zhousun opened this issue 3 months ago • 3 comments

What happens?

In List in postgres is represented as = ANY(ARRAY). Duckdb will then turn it into a semi join, which disables many optimizations and performs much worse than a simple inlist, especially when used with joins.

To Reproduce

create TEMP table t(a int)using duckdb; 
insert into t values(1); 
explain select * from t where a in (1,2);

vscode (pid: 1323) =# explain select * from t where a in (1,2);
                         QUERY PLAN                         
------------------------------------------------------------
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Execution Plan: 
 
 ┌───────────────────────────┐
 │         HASH_JOIN         │
 │    ────────────────────   │
 │      Join Type: SEMI      │
 │     Conditions: a = #0    ├──────────────┐
 │                           │              │
 │          ~0 Rows          │              │
 └─────────────┬─────────────┘              │
 ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
 │         SEQ_SCAN          ││         PROJECTION        │
 │    ────────────────────   ││    ────────────────────   │
 │          Table: t         ││  UNNEST(list_value(1, 2)) │
 │   Type: Sequential Scan   ││                           │
 │       Projections: a      ││                           │
 │                           ││                           │
 │          ~1 Rows          ││          ~1 Rows          │
 └───────────────────────────┘└─────────────┬─────────────┘
                              ┌─────────────┴─────────────┐
                              │           UNNEST          │
                              └─────────────┬─────────────┘
                              ┌─────────────┴─────────────┐
                              │         DUMMY_SCAN        │
                              └───────────────────────────┘
 
 
(28 rows)

OS:

Mac.OS

pg_duckdb Version (if built from source use commit hash):

latest

Postgres Version (if built from source use commit hash):

17

Hardware:

No response

Full Name:

Zhou Sun

Affiliation:

Mooncake Labs

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • [x] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • [x] Yes, I have

zhousun avatar Aug 27 '25 21:08 zhousun

I think this is hard/impossible to change in pg_duckdb. Sounds like an easier approach would be to have duckdb optimize this query plan better.

JelteF avatar Aug 28 '25 07:08 JelteF

I believe it should be doable in pg_duckdb, since pg_duckdb controls the whole to_sql logic via pg_ruleutils If the node is a ScalarArrayOpExpr and the RHS is a literal array, to_sql it as: <lhs> IN (<elem1>, <elem2>, ...) instead of <lhs> = ANY('{...}'::<type>[])

zhousun avatar Aug 28 '25 17:08 zhousun

I'm trying to limit the amount of changes from the upstream pg_ruleutils. So I'd much rather have this optimization added to duckdb, instead of working around it by generating different sql. That way not only pg_duckdb benefits from it as well. So I'd say open a PR to upstream DuckDB that adds this optimization. Or at least report the issue there so that the know an optimization is missing.

JelteF avatar Sep 24 '25 08:09 JelteF