pg_duckdb icon indicating copy to clipboard operation
pg_duckdb copied to clipboard

Support parallel DuckDB threads for Postgres table scan

Open YuweiXiao opened this issue 7 months ago • 1 comments

Currently, we use a single DuckDB thread for Postgres table scan, even though multiple Postgres workers will be initialized. This leads to a performance bottleneck when scanning large amounts of data.

This PR parallelizes the conversion from Postgres tuple to DuckDB data chunk. Below are benchmark results on a 5GB TPCH lineitem table.

  • Benchmark query: select * from lineitem order by 1 limit 1
  • Other GUC setups: duckdb.max_workers_per_postgres_scan = 2
Threads (duckdb.threads_for_postgres_scan) Costs (seconds)
1 15.8
2 8.7
4 5.8

YuweiXiao avatar May 07 '25 07:05 YuweiXiao

@JelteF Thanks for the review! YES, go for 1.1.0 is reasonable.

YuweiXiao avatar May 07 '25 12:05 YuweiXiao

Do you plan on addressing the review feedback, I'm considering maybe merging this in for 1.0 anyway if it's in a good state.

JelteF avatar May 30 '25 09:05 JelteF

Yeah, that would be nice! Let me resolve the conflict first.

YuweiXiao avatar May 30 '25 09:05 YuweiXiao

@JelteF Hey, restrictions on unsafe types like JSON/LIST have been removed by converting Postgres slots into DuckDB data chunks in a columnar fashion. If any other unsafe type is supported in the future, one only needs to add it to IsThreadSafeTypeForPostgresToDuckDB.

btw, the columnar conversion can be optimized by eliminating if-else branch (also switch statement). This may involve a large amount of code refactoring.

YuweiXiao avatar Jun 09 '25 07:06 YuweiXiao