sqls
sqls copied to clipboard
perf: optimize pg describe tables query
Summary
describe tables query for postgres joined information schema views which is not efficient and specially notorious for large databases, instead use the catalog tables directly
Set up the environment
We will create a postgres 12 database and populate it with 200 tables with primary key
Using the following init.sql script:
-- init.sql
CREATE TABLE IF NOT EXISTS public.client_types (
id SERIAL PRIMARY KEY,
name VARCHAR(255)
);
-- create multiple tables with foreign keys in public
DO $$
DECLARE
table_prefix TEXT := 'client_catalog_';
number_of_tables_to_create INT := 200;
BEGIN
-- create tables
FOR i IN 1..number_of_tables_to_create LOOP
EXECUTE format('
CREATE TABLE IF NOT EXISTS public.%I (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
type_id INTEGER NOT NULL,
FOREIGN KEY(type_id) REFERENCES client_types(id)
)' , table_prefix || i::text);
END LOOP;
END $$;
First start a postgres server that initializes with init.sql
$ docker run --rm -p 15432:5432 -e POSTGRES_USER=example_user -e POSTGRES_PASSWORD=example_password -e POSTGRES_DB=example_db -v $PWD/init.sql:/docker-entrypoint-initdb.d/init.sql:ro postgres:12-alpine
Proposal
Let's first test that both the current query and the new proposal produce the same result.
The current query uses several information schema views joined, the filters given don't allow for trimming results before joining, this is the current query:
-- current_query.sql
SELECT
c.table_schema,
c.table_name,
c.column_name,
c.data_type,
c.is_nullable,
CASE t.constraint_type
WHEN 'PRIMARY KEY' THEN 'YES'
ELSE 'NO'
END,
c.column_default,
''
FROM
information_schema.columns c
LEFT JOIN (
SELECT
ccu.table_schema as table_schema,
ccu.table_name as table_name,
ccu.column_name as column_name,
tc.constraint_type as constraint_type
FROM information_schema.constraint_column_usage ccu
LEFT JOIN information_schema.table_constraints tc ON
tc.table_schema = ccu.table_schema
AND tc.table_name = ccu.table_name
AND tc.constraint_name = ccu.constraint_name
WHERE
ccu.table_schema = 'public'
AND tc.constraint_type = 'PRIMARY KEY'
) as t
ON c.table_schema = t.table_schema
AND c.table_name = t.table_name
AND c.column_name = t.column_name
WHERE
c.table_schema = 'public'
ORDER BY
c.table_name,
c.ordinal_position
The optimized query uses directly the catalog tables (built using the information schema view definition) where results are filtered before join
-- new_query.sql
SELECT DISTINCT ON (table_schema, table_name, a.attnum)
c1.relnamespace::regnamespace AS table_schema, c1.relname AS table_name, a.attname AS column_name, CASE WHEN
t.typtype = 'd'::"char" THEN
CASE WHEN bt.typelem <> 0::oid
AND bt.typlen = '-1'::integer THEN
'ARRAY'::text
WHEN nbt.nspname = 'pg_catalog'::name THEN
format_type(t.typbasetype, NULL::integer)
ELSE
'USER-DEFINED'::text
END
ELSE
CASE WHEN t.typelem <> 0::oid
AND t.typlen = '-1'::integer THEN
'ARRAY'::text
WHEN nt.nspname = 'pg_catalog'::name THEN
format_type(a.atttypid, NULL::integer)
ELSE
'USER-DEFINED'::text
END
END::information_schema.character_data AS data_type, CASE WHEN a.attnotnull
OR t.typtype = 'd'::"char"
AND t.typnotnull THEN
'NO'::text
ELSE
'YES'::text
END::information_schema.yes_or_no AS is_nullable, CASE WHEN conn.contype = 'p' THEN
'YES'
ELSE
'NO'
END AS is_primary_key, CASE WHEN a.attgenerated = ''::"char" THEN
pg_get_expr(ad.adbin, ad.adrelid)
ELSE
NULL::text
END::information_schema.character_data AS column_default, ''
FROM pg_catalog.pg_class c1
JOIN pg_catalog.pg_attribute a ON a.attrelid = c1.oid
JOIN (pg_type t
JOIN pg_namespace nt ON t.typnamespace = nt.oid) ON a.atttypid = t.oid
LEFT JOIN (pg_type bt
JOIN pg_namespace nbt ON bt.typnamespace = nbt.oid) ON t.typtype = 'd'::char
AND t.typbasetype = bt.oid
LEFT JOIN pg_catalog.pg_constraint conn ON conn.conrelid = c1.oid
AND a.attnum = ANY (conn.conkey)
AND conn.contype = 'p'
LEFT JOIN pg_attrdef ad ON a.attrelid = ad.adrelid
AND a.attnum = ad.adnum
WHERE c1.relnamespace = 'public'::regnamespace::oid
AND c1.relkind = ANY (ARRAY['p', 'r', 'v'])
AND a.attnum > 0
AND (pg_has_role(c1.relowner, 'USAGE'::text)
OR has_column_privilege(c1.oid, a.attnum, 'SELECT, INSERT, UPDATE, REFERENCES'::text))
ORDER BY table_name, a.attnum
We can compare the results via the following command:
$ psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql > current_query_results
$ psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql > new_query_results
$ diff -s current_query_results new_query_results
Files current_query_results and new_query_results are identical
Now that we know that both produce the same result, let's check the performance
Performance analysis
First, the performance for the current query:
$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql
Time (mean ± σ): 502.3 ms ± 17.4 ms [User: 39.3 ms, System: 9.8 ms]
Range (min … max): 472.7 ms … 527.5 ms 10 runs
And the result for the new query:
$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql
Time (mean ± σ): 74.5 ms ± 16.2 ms [User: 37.1 ms, System: 5.7 ms]
Range (min … max): 54.4 ms … 115.9 ms 43 runs
Do note that if we create the database with 400 tables instead of 200, the results change greatly:
$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql
Time (mean ± σ): 2.258 s ± 0.457 s [User: 0.038 s, System: 0.008 s]
Range (min … max): 1.683 s … 2.679 s 10 runs
$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql
Time (mean ± σ): 95.6 ms ± 21.2 ms [User: 40.4 ms, System: 9.6 ms]
Range (min … max): 67.5 ms … 130.4 ms 22 runs
there is a big jump in the currrent query's performance but not for the new proposal