postgres_scanner icon indicating copy to clipboard operation
postgres_scanner copied to clipboard

New ATTACH very slow compared to postgres_attach

Open mnovelo opened this issue 10 months ago • 4 comments

What happens?

When connecting to our production database, which has hundreds of schemas, each with over a hundred tables, using the new ATTACH command takes much longer than the now deprecated postgres_attach function.

We've even tried passing in --search_path=my_schema as an option to the connection string, but didn't see any real improvement.

Having the pg_debug_show_queries=true and running the initial queries on a separate connection, we see that this query

SELECT pg_namespace.oid AS namespace_id, relname, relpages, attname,
    pg_type.typname type_name, atttypmod type_modifier, pg_attribute.attndims ndim,
    attnum, pg_attribute.attnotnull AS notnull, NULL constraint_id,
    NULL constraint_type, NULL constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_attribute ON pg_class.oid=pg_attribute.attrelid
JOIN pg_type ON atttypid=pg_type.oid
WHERE attnum > 0 AND relkind IN ('r', 'v', 'm', 'f', 'p')
UNION ALL
SELECT pg_namespace.oid AS namespace_id, relname, NULL relpages, NULL attname, NULL type_name,
    NULL type_modifier, NULL ndim, NULL attnum, NULL AS notnull,
    pg_constraint.oid AS constraint_id, contype AS constraint_type,
    conkey AS constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_constraint ON (pg_class.oid=pg_constraint.conrelid)
WHERE relkind IN ('r', 'v', 'm', 'f', 'p') AND contype IN ('p', 'u')
ORDER BY namespace_id, relname, attnum, constraint_id;

in particular takes the longest to run when attaching the database. Is there any way to optimize this query? Perhaps allowing a parameter to specify only some schemas to load?

To Reproduce

INSTALL postgres;
LOAD postgres;
SET pg_debug_show_queries = true;
ATTACH '{self._psql_string}' AS postgres_db (TYPE POSTGRES, READ_ONLY);

The ATTACH statement takes about 6 minutes on average every time, compared to the previous method

INSTALL postgres;
LOAD postgres;
CALL postgres_attach(
    '{self._psql_string}',
    source_schema='{schema}',
    filter_pushdown=1,
    overwrite=1
);

which takes about 1 minute on average

OS:

macOS

PostgreSQL Version:

15.5

DuckDB Version:

0.10.2

DuckDB Client:

Python

Full Name:

Mauricio Novelo

Affiliation:

CampusESP

Have you tried this on the latest main branch?

  • [X] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] I agree

mnovelo avatar Apr 21 '24 20:04 mnovelo

Not disputing your report at all, but I've had exactly the opposite experience. The new query takes ~1sec compared to multiple minutes for the old approach.

I think my db is much smaller. Tough to find a one-size solution for all the different setups out there.

davidgovea avatar Apr 22 '24 15:04 davidgovea

I hear ya @davidgovea

My current hypothesis is that the bottleneck for us is this query

SELECT pg_namespace.oid AS namespace_id, relname, relpages, attname,
    pg_type.typname type_name, atttypmod type_modifier, pg_attribute.attndims ndim,
    attnum, pg_attribute.attnotnull AS notnull, NULL constraint_id,
    NULL constraint_type, NULL constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_attribute ON pg_class.oid=pg_attribute.attrelid
JOIN pg_type ON atttypid=pg_type.oid
WHERE attnum > 0 AND relkind IN ('r', 'v', 'm', 'f', 'p')
UNION ALL
SELECT pg_namespace.oid AS namespace_id, relname, NULL relpages, NULL attname, NULL type_name,
    NULL type_modifier, NULL ndim, NULL attnum, NULL AS notnull,
    pg_constraint.oid AS constraint_id, contype AS constraint_type,
    conkey AS constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_constraint ON (pg_class.oid=pg_constraint.conrelid)
WHERE relkind IN ('r', 'v', 'm', 'f', 'p') AND contype IN ('p', 'u')
ORDER BY namespace_id, relname, attnum, constraint_id;

I suspect that it executes in < 1 second for you, while for us it loads about a million records, so having a way to specify specific schemas and tables would be really helpful for us. I'm inclined to try to make a PR to that effect if this is reasonable.

mnovelo avatar Apr 22 '24 18:04 mnovelo

Thanks for reporting! That makes sense - perhaps adding a SCHEMA option to ATTACH would make sense? Happy to review a PR.

Mytherin avatar Apr 22 '24 18:04 Mytherin

+1 I've encountered the same issue.

adriangb avatar May 31 '24 07:05 adriangb