postgres_scanner icon indicating copy to clipboard operation
postgres_scanner copied to clipboard

How to run ONLY select queries against Postgres databases?

Open AlphaJack opened this issue 5 months ago • 6 comments

Services like Cube and libraries like pgwire implement the PostgreSQL wire protocol as servers, however they only support a subset of PostgreSQL functionalities. In particular, they don't expose catalog tables, and have limited parsing capabilities for incoming queries.

I could successfully run simple SELECT queries against these servers with a variety of clients (psql, pgbench, psycopg, superset, ...), however I could not make DuckDB simplify its queries and skip helper queries against the catalog.

How can I make DuckDB only run the specified SQL query, and nothing else?

Example 1:

INSTALL postgres_scanner;
LOAD postgres_scanner;

DETACH DATABASE IF EXISTS postgres_db;

SET pg_connection_cache = FALSE;
SET pg_debug_show_queries = TRUE;
SET pg_experimental_filter_pushdown = TRUE;
SET pg_use_binary_copy = FALSE;
SET pg_use_ctid_scan = FALSE;

ATTACH 'host=localhost port=15432 password=XXXXXX' AS postgres_db (TYPE postgres);

SELECT * FROM postgres_query('postgres_db', 'SELECT orders.status FROM orders');

This results in a parse error on the server, as the query is transformed to:

COPY (SELECT "status" FROM (SELECT orders.status FROM orders) AS __unnamed_subquery  LIMIT 50001) TO STDOUT (FORMAT "binary");

Example 2:

SELECT orders.status FROM postgres_db.public.orders
Invalid Error: Failed to execute query "BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;

SELECT oid, nspname
FROM pg_namespace

ORDER BY oid;

SELECT pg_namespace.oid AS namespace_id, relname, relpages, attname,
    pg_type.typname type_name, atttypmod type_modifier, pg_attribute.attndims ndim,
    attnum, pg_attribute.attnotnull AS notnull, NULL constraint_id,
    NULL constraint_type, NULL constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_attribute ON pg_class.oid=pg_attribute.attrelid
JOIN pg_type ON atttypid=pg_type.oid
WHERE attnum > 0 AND relkind IN ('r', 'v', 'm', 'f', 'p') 
UNION ALL
SELECT pg_namespace.oid AS namespace_id, relname, NULL relpages, NULL attname, NULL type_name,
    NULL type_modifier, NULL ndim, NULL attnum, NULL AS notnull,
    pg_constraint.oid AS constraint_id, contype AS constraint_type,
    conkey AS constraint_key
FROM pg_class
JOIN pg_namespace ON relnamespace = pg_namespace.oid
JOIN pg_constraint ON (pg_class.oid=pg_constraint.conrelid)
WHERE relkind IN ('r', 'v', 'm', 'f', 'p') AND contype IN ('p', 'u') 
ORDER BY namespace_id, relname, attnum, constraint_id;

SELECT 0 AS oid, 0 AS enumtypid, '' AS typname, '' AS enumlabel
LIMIT 0;

SELECT n.oid, t.typrelid AS id, t.typname as type, pg_attribute.attname, sub_type.typname
FROM pg_type t
JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace
JOIN pg_class ON pg_class.oid = t.typrelid
JOIN pg_attribute ON attrelid=t.typrelid
JOIN pg_type sub_type ON (pg_attribute.atttypid=sub_type.oid)
WHERE pg_class.relkind = 'c'
AND t.typtype='c'

ORDER BY n.oid, t.oid, attrelid, attnum;

SELECT pg_namespace.oid, tablename, indexname
FROM pg_indexes
JOIN pg_namespace ON (schemaname=nspname)

ORDER BY pg_namespace.oid;
": ERROR:  Initial planning error: Error during planning: Table or CTE with name 'pg_indexes' not found
QUERY: SELECT pg_namespace.oid, tablename, indexname FROM pg_indexes JOIN pg_namespace ON (schemaname = nspname) ORDER BY pg_namespace.oid

AlphaJack avatar Jul 02 '25 18:07 AlphaJack

This should be possible since https://github.com/duckdb/duckdb-postgres/pull/346 by specifying the text protocol to be used instead of the binary protocol:

SET pg_use_text_protocol=true;

Mytherin avatar Jul 17 '25 10:07 Mytherin

Thank you, when will this be released and available with INSTALL postgres_scanner?

AlphaJack avatar Jul 18 '25 11:07 AlphaJack

Next release - but you can try the nightly (FORCE INSTALL postgres FROM core_nightly).

Mytherin avatar Jul 18 '25 14:07 Mytherin

I'm experiencing a similar issue to the one described here.

The PostgreSQL schema I'm trying to connect to has a large number of tables (in the thousands), which causes the catalog table access in Example 2 to take several seconds. However, the actual SQL query I want to execute takes less than a second.

I would like to execute only my intended SQL query without accessing the catalog tables. I tried using SET pg_use_text_protocol=true;, but in Example 2, catalog tables were still being accessed.

Is there a way to access only the intended SQL query in Example 2, bypassing the catalog table access entirely?

ktny avatar Jul 21 '25 04:07 ktny

You can use postgres_scan to execute only a query

Mytherin avatar Jul 21 '25 07:07 Mytherin

Thanks. I tried it and postgres_scan avoids access to catalog table.

ktny avatar Jul 21 '25 09:07 ktny