postgres_scanner How to speed up queries with DuckDB connected to PostgreSQL DB using postgres

I took inspiration from https://duckdb.org/2022/09/30/postgres-scanner.html and setup a connection between DuckDB and our Supabase project. Simple queries, like the one listed below, take an average of 40s to complete. I believe this is quite slow when compared to the performance benchmarks.

import duckdb
from duckdb import DuckDBPyConnection

DB_URL = "postgresql://postgres:<user>@<link>:<port>/<database_name>"


def connect_duckdb_supabase() -> DuckDBPyConnection:
    con = duckdb.connect()
    con.execute("INSTALL postgres_scanner")
    con.execute("LOAD postgres_scanner")
    con.execute(f"CALL postgres_attach('{DB_URL}')")

    return con


if __name__ == "__main__":
    con = connect_duckdb_supabase()

    import time
    for _ in range(5):
        start = time.time()
        res = con.execute("SELECT * FROM index WHERE label='fine'").fetchall()
        print(f"duration: {time.time() - start}")

I'm running DuckDB (v0.9.1) on a VM with 1TB RAM and 172 CPUs Our Supabase PostgreSQL DB (v15.1.0.129) is backed by 1GB RAM, 2vCPUs and 8GB file storage Both the VM and DB are in the same region us-east-1. My VM has a 10Gbps internet bandwidth.

The index table has 15M rows (4.3GB in size) with the following schema: id (primary key, text), label (text), classifier_softmax (float[])

My understanding is that DuckDB will auto choose max number of threads depending on the os.cpu_count.

Any help on how to speed up and optimise my analytic queries will be greatly appreciated!

Oct 30 '23 16:10 sidroopdaska

For point queries you might want to enable filter pushdown in the attach, e.g.:

CALL postgres_attach('{DB_URL}', filter_pushdown=true);

Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine which means the data transfer is significantly faster than it would be if they were connected over a network connection.

Oct 30 '23 16:10 Mytherin

I tried the filter_pushdown=true flag, but there was no noticeable difference.

Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine

I see

Besides this feature flag, are there any other optimisations I can make?

Oct 30 '23 16:10 sidroopdaska

@sidroopdaska Have you found a solution/reason for the performance problem? I also have performance issues with my large table (118 million rows) which runs indefinitely even with simple queries.

SELECT * FROM large_table WHERE id = 1234;

Nov 28 '23 14:11 StephanGeorg

Have you already tried enabling SET pg_experimental_filter_pushdown=true?

Nov 28 '23 15:11 Mytherin

Yes, just did that. Thank you for mentioning this. For me it was not clear that

SELECT * FROM large_table WHERE id = 1234;

is converted into

COPY (
  SELECT "id" FROM "public"."large_table" 
  WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid)
TO STDOUT (FORMAT binary);

in my case

SET pg_experimental_filter_pushdown=TRUE;

led to

COPY (
  SELECT "id" 
  FROM "public"."large_table" 
  WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid AND ("id" = '1234' AND "id" IS NOT NULL)
  ) TO STDOUT (FORMAT binary);

and also increasing

SET pg_pages_per_task = 100000;

helped to speed up the query a lot.

Nov 28 '23 15:11 StephanGeorg

Similar issue here - I think my ctids are pretty fragmented due to high writes to the table. We're using RDS with postgres 15.

I'm fetching about 5mil rows in 12sec when I use a single batch (setting pg_pages_per_task above 5mil), but as soon as it starts paging, the process takes much longer.

Setting it to 3mil (for 2 batches) takes about 3x as long.

It could be nice to be able to turn off paging altogether, or to specify a different numerical partition key (connector-x allows this)

Feb 03 '24 22:02 davidgovea

Yup, you can add a fourth user reporting this problem.

Apr 19 '24 22:04 j-carson

And a fifth. I was not aware of the experimental filter pushdown flag, but the extra overhead from paging made the connector far too slow to use for day-to-day queries on large tables.

Apr 20 '24 19:04 teaguesterling

I think that this new option might resolve this issue:

https://github.com/duckdb/postgres_scanner/pull/195

SET pg_use_ctid_scan=false;

FWIW we have been using this workaround in prod for a while: SET pg_pages_per_task=9876543; // silly huge number that effectively disables paging

but we'll move to the new option as soon as it's available (not sure if it's landed in a release yet, and we're not on latest 10.x)

Apr 22 '24 15:04 davidgovea

I tested this in a sandbox and SET pg_use_ctid_scan=false along with SET pg_experimental_filter_pushdown=true worked well for WHERE clauses. Aggregates didn't push down, but that was not expected.

Apr 22 '24 16:04 teaguesterling

postgres_scanner
postgres_scanner copied to clipboard

How to speed up queries with DuckDB connected to PostgreSQL DB using postgres_scanner?

postgres_scanner postgres_scanner copied to clipboard

How to speed up queries with DuckDB connected to PostgreSQL DB using postgres_scanner?

postgres_scanner
postgres_scanner copied to clipboard