postgres_scanner
postgres_scanner copied to clipboard
How to speed up queries with DuckDB connected to PostgreSQL DB using postgres_scanner?
I took inspiration from https://duckdb.org/2022/09/30/postgres-scanner.html and setup a connection between DuckDB and our Supabase project. Simple queries, like the one listed below, take an average of 40s to complete. I believe this is quite slow when compared to the performance benchmarks.
import duckdb
from duckdb import DuckDBPyConnection
DB_URL = "postgresql://postgres:<user>@<link>:<port>/<database_name>"
def connect_duckdb_supabase() -> DuckDBPyConnection:
con = duckdb.connect()
con.execute("INSTALL postgres_scanner")
con.execute("LOAD postgres_scanner")
con.execute(f"CALL postgres_attach('{DB_URL}')")
return con
if __name__ == "__main__":
con = connect_duckdb_supabase()
import time
for _ in range(5):
start = time.time()
res = con.execute("SELECT * FROM index WHERE label='fine'").fetchall()
print(f"duration: {time.time() - start}")
I'm running DuckDB (v0.9.1) on a VM with 1TB RAM and 172 CPUs Our Supabase PostgreSQL DB (v15.1.0.129) is backed by 1GB RAM, 2vCPUs and 8GB file storage Both the VM and DB are in the same region us-east-1. My VM has a 10Gbps internet bandwidth.
The index
table has 15M rows (4.3GB in size) with the following schema: id (primary key, text), label (text), classifier_softmax (float[])
My understanding is that DuckDB will auto choose max number of threads depending on the os.cpu_count.
Any help on how to speed up and optimise my analytic queries will be greatly appreciated!
For point queries you might want to enable filter pushdown in the attach, e.g.:
CALL postgres_attach('{DB_URL}', filter_pushdown=true);
Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine which means the data transfer is significantly faster than it would be if they were connected over a network connection.
I tried the filter_pushdown=true
flag, but there was no noticeable difference.
Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine
I see
Besides this feature flag, are there any other optimisations I can make?
@sidroopdaska Have you found a solution/reason for the performance problem? I also have performance issues with my large table (118 million rows) which runs indefinitely even with simple queries.
SELECT * FROM large_table WHERE id = 1234;
Have you already tried enabling SET pg_experimental_filter_pushdown=true
?
Yes, just did that. Thank you for mentioning this. For me it was not clear that
SELECT * FROM large_table WHERE id = 1234;
is converted into
COPY (
SELECT "id" FROM "public"."large_table"
WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid)
TO STDOUT (FORMAT binary);
in my case
SET pg_experimental_filter_pushdown=TRUE;
led to
COPY (
SELECT "id"
FROM "public"."large_table"
WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid AND ("id" = '1234' AND "id" IS NOT NULL)
) TO STDOUT (FORMAT binary);
and also increasing
SET pg_pages_per_task = 100000;
helped to speed up the query a lot.
Similar issue here - I think my ctids are pretty fragmented due to high writes to the table. We're using RDS with postgres 15.
I'm fetching about 5mil rows in 12sec when I use a single batch (setting pg_pages_per_task above 5mil), but as soon as it starts paging, the process takes much longer.
Setting it to 3mil (for 2 batches) takes about 3x as long.
It could be nice to be able to turn off paging altogether, or to specify a different numerical partition key (connector-x allows this)
Yup, you can add a fourth user reporting this problem.
And a fifth. I was not aware of the experimental filter pushdown flag, but the extra overhead from paging made the connector far too slow to use for day-to-day queries on large tables.
I think that this new option might resolve this issue:
- https://github.com/duckdb/postgres_scanner/pull/195
SET pg_use_ctid_scan=false;
FWIW we have been using this workaround in prod for a while:
SET pg_pages_per_task=9876543;
// silly huge number that effectively disables paging
but we'll move to the new option as soon as it's available (not sure if it's landed in a release yet, and we're not on latest 10.x)
I tested this in a sandbox and SET pg_use_ctid_scan=false
along with SET pg_experimental_filter_pushdown=true
worked well for WHERE clauses. Aggregates didn't push down, but that was not expected.