postgres_scanner
postgres_scanner copied to clipboard
New ATTACH very slow compared to postgres_attach
What happens?
When connecting to our production database, which has hundreds of schemas, each with over a hundred tables, using the new ATTACH command takes much longer than the now deprecated postgres_attach
function.
We've even tried passing in --search_path=my_schema
as an option to the connection string, but didn't see any real improvement.
Having the pg_debug_show_queries=true
and running the initial queries on a separate connection, we see that this query
SELECT pg_namespace.oid AS namespace_id, relname, relpages, attname,
pg_type.typname type_name, atttypmod type_modifier, pg_attribute.attndims ndim,
attnum, pg_attribute.attnotnull AS notnull, NULL constraint_id,
NULL constraint_type, NULL constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_attribute ON pg_class.oid=pg_attribute.attrelid
JOIN pg_type ON atttypid=pg_type.oid
WHERE attnum > 0 AND relkind IN ('r', 'v', 'm', 'f', 'p')
UNION ALL
SELECT pg_namespace.oid AS namespace_id, relname, NULL relpages, NULL attname, NULL type_name,
NULL type_modifier, NULL ndim, NULL attnum, NULL AS notnull,
pg_constraint.oid AS constraint_id, contype AS constraint_type,
conkey AS constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_constraint ON (pg_class.oid=pg_constraint.conrelid)
WHERE relkind IN ('r', 'v', 'm', 'f', 'p') AND contype IN ('p', 'u')
ORDER BY namespace_id, relname, attnum, constraint_id;
in particular takes the longest to run when attaching the database. Is there any way to optimize this query? Perhaps allowing a parameter to specify only some schemas to load?
To Reproduce
INSTALL postgres;
LOAD postgres;
SET pg_debug_show_queries = true;
ATTACH '{self._psql_string}' AS postgres_db (TYPE POSTGRES, READ_ONLY);
The ATTACH statement takes about 6 minutes on average every time, compared to the previous method
INSTALL postgres;
LOAD postgres;
CALL postgres_attach(
'{self._psql_string}',
source_schema='{schema}',
filter_pushdown=1,
overwrite=1
);
which takes about 1 minute on average
OS:
macOS
PostgreSQL Version:
15.5
DuckDB Version:
0.10.2
DuckDB Client:
Python
Full Name:
Mauricio Novelo
Affiliation:
CampusESP
Have you tried this on the latest main
branch?
- [X] I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] I agree
Not disputing your report at all, but I've had exactly the opposite experience. The new query takes ~1sec compared to multiple minutes for the old approach.
I think my db is much smaller. Tough to find a one-size solution for all the different setups out there.
I hear ya @davidgovea
My current hypothesis is that the bottleneck for us is this query
SELECT pg_namespace.oid AS namespace_id, relname, relpages, attname,
pg_type.typname type_name, atttypmod type_modifier, pg_attribute.attndims ndim,
attnum, pg_attribute.attnotnull AS notnull, NULL constraint_id,
NULL constraint_type, NULL constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_attribute ON pg_class.oid=pg_attribute.attrelid
JOIN pg_type ON atttypid=pg_type.oid
WHERE attnum > 0 AND relkind IN ('r', 'v', 'm', 'f', 'p')
UNION ALL
SELECT pg_namespace.oid AS namespace_id, relname, NULL relpages, NULL attname, NULL type_name,
NULL type_modifier, NULL ndim, NULL attnum, NULL AS notnull,
pg_constraint.oid AS constraint_id, contype AS constraint_type,
conkey AS constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_constraint ON (pg_class.oid=pg_constraint.conrelid)
WHERE relkind IN ('r', 'v', 'm', 'f', 'p') AND contype IN ('p', 'u')
ORDER BY namespace_id, relname, attnum, constraint_id;
I suspect that it executes in < 1 second for you, while for us it loads about a million records, so having a way to specify specific schemas and tables would be really helpful for us. I'm inclined to try to make a PR to that effect if this is reasonable.
Thanks for reporting! That makes sense - perhaps adding a SCHEMA
option to ATTACH
would make sense? Happy to review a PR.
+1 I've encountered the same issue.