sqls perf: optimize pg describe tables query

perf: optimize pg describe tables query

Open aokiji opened this issue 10 months ago • 0 comments

Summary

describe tables query for postgres joined information schema views which is not efficient and specially notorious for large databases, instead use the catalog tables directly

Set up the environment

We will create a postgres 12 database and populate it with 200 tables with primary key

Using the following init.sql script:

-- init.sql

CREATE TABLE IF NOT EXISTS public.client_types (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255)
);

-- create multiple tables with foreign keys in public
DO $$
DECLARE
    table_prefix TEXT := 'client_catalog_';
    number_of_tables_to_create INT := 200;
BEGIN

    -- create tables
    FOR i IN 1..number_of_tables_to_create LOOP
        EXECUTE format('
            CREATE TABLE IF NOT EXISTS public.%I ( 
                id SERIAL PRIMARY KEY,
                name VARCHAR(255),
                type_id INTEGER NOT NULL,
                FOREIGN KEY(type_id) REFERENCES client_types(id)
            )' , table_prefix || i::text);

    END LOOP;
END $$;

First start a postgres server that initializes with init.sql

$ docker run --rm -p 15432:5432 -e POSTGRES_USER=example_user -e POSTGRES_PASSWORD=example_password -e POSTGRES_DB=example_db -v $PWD/init.sql:/docker-entrypoint-initdb.d/init.sql:ro postgres:12-alpine

Proposal

Let's first test that both the current query and the new proposal produce the same result.

The current query uses several information schema views joined, the filters given don't allow for trimming results before joining, this is the current query:

-- current_query.sql
	SELECT
		c.table_schema,
		c.table_name,
		c.column_name,
		c.data_type,
		c.is_nullable,
		CASE t.constraint_type
			WHEN 'PRIMARY KEY' THEN 'YES'
			ELSE 'NO'
		END,
		c.column_default,
		''
	FROM
		information_schema.columns c
	LEFT JOIN (
		SELECT
			ccu.table_schema as table_schema,
			ccu.table_name as table_name,
			ccu.column_name as column_name,
			tc.constraint_type as constraint_type
		FROM information_schema.constraint_column_usage ccu
		LEFT JOIN information_schema.table_constraints tc ON
			tc.table_schema = ccu.table_schema
			AND tc.table_name = ccu.table_name
			AND tc.constraint_name = ccu.constraint_name
		WHERE
			ccu.table_schema = 'public'
			AND tc.constraint_type = 'PRIMARY KEY'
	) as t
		ON c.table_schema = t.table_schema
		AND c.table_name = t.table_name
		AND c.column_name = t.column_name
	WHERE
		c.table_schema = 'public'
	ORDER BY
		c.table_name,
		c.ordinal_position

The optimized query uses directly the catalog tables (built using the information schema view definition) where results are filtered before join

-- new_query.sql
	SELECT DISTINCT ON (table_schema, table_name, a.attnum)
	    c1.relnamespace::regnamespace AS table_schema, c1.relname AS table_name, a.attname AS column_name, CASE WHEN
		t.typtype = 'd'::"char" THEN
		CASE WHEN bt.typelem <> 0::oid
		    AND bt.typlen = '-1'::integer THEN
		    'ARRAY'::text
		WHEN nbt.nspname = 'pg_catalog'::name THEN
		    format_type(t.typbasetype, NULL::integer)
		ELSE
		    'USER-DEFINED'::text
		END
	    ELSE
		CASE WHEN t.typelem <> 0::oid
		    AND t.typlen = '-1'::integer THEN
		    'ARRAY'::text
		WHEN nt.nspname = 'pg_catalog'::name THEN
		    format_type(a.atttypid, NULL::integer)
		ELSE
		    'USER-DEFINED'::text
		END
	    END::information_schema.character_data AS data_type, CASE WHEN a.attnotnull
		OR t.typtype = 'd'::"char"
		AND t.typnotnull THEN
		'NO'::text
	    ELSE
		'YES'::text
	    END::information_schema.yes_or_no AS is_nullable, CASE WHEN conn.contype = 'p' THEN
		'YES'
	    ELSE
		'NO'
	    END AS is_primary_key, CASE WHEN a.attgenerated = ''::"char" THEN
		pg_get_expr(ad.adbin, ad.adrelid)
	    ELSE
		NULL::text
	    END::information_schema.character_data AS column_default, ''
	FROM pg_catalog.pg_class c1
	    JOIN pg_catalog.pg_attribute a ON a.attrelid = c1.oid
	    JOIN (pg_type t
		JOIN pg_namespace nt ON t.typnamespace = nt.oid) ON a.atttypid = t.oid
	    LEFT JOIN (pg_type bt
		JOIN pg_namespace nbt ON bt.typnamespace = nbt.oid) ON t.typtype = 'd'::char
		AND t.typbasetype = bt.oid
	    LEFT JOIN pg_catalog.pg_constraint conn ON conn.conrelid = c1.oid
		AND a.attnum = ANY (conn.conkey)
		AND conn.contype = 'p'
	    LEFT JOIN pg_attrdef ad ON a.attrelid = ad.adrelid
		AND a.attnum = ad.adnum
	WHERE c1.relnamespace = 'public'::regnamespace::oid
	    AND c1.relkind = ANY (ARRAY['p', 'r', 'v'])
	    AND a.attnum > 0
	    AND (pg_has_role(c1.relowner, 'USAGE'::text)
		OR has_column_privilege(c1.oid, a.attnum, 'SELECT, INSERT, UPDATE, REFERENCES'::text))
	ORDER BY table_name, a.attnum

We can compare the results via the following command:

$ psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql > current_query_results
$ psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql > new_query_results
$ diff -s current_query_results new_query_results
Files current_query_results and new_query_results are identical

Now that we know that both produce the same result, let's check the performance

Performance analysis

First, the performance for the current query:

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql
  Time (mean ± σ):     502.3 ms ±  17.4 ms    [User: 39.3 ms, System: 9.8 ms]
  Range (min … max):   472.7 ms … 527.5 ms    10 runs

And the result for the new query:

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql
  Time (mean ± σ):      74.5 ms ±  16.2 ms    [User: 37.1 ms, System: 5.7 ms]
  Range (min … max):    54.4 ms … 115.9 ms    43 runs

Do note that if we create the database with 400 tables instead of 200, the results change greatly:

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql
  Time (mean ± σ):      2.258 s ±  0.457 s    [User: 0.038 s, System: 0.008 s]
  Range (min … max):    1.683 s …  2.679 s    10 runs

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql
  Time (mean ± σ):      95.6 ms ±  21.2 ms    [User: 40.4 ms, System: 9.6 ms]
  Range (min … max):    67.5 ms … 130.4 ms    22 runs

there is a big jump in the currrent query's performance but not for the new proposal

Apr 10 '24 18:04 aokiji

sqls sqls copied to clipboard

perf: optimize pg describe tables query

Summary

Set up the environment

Proposal

Performance analysis

sqls
sqls copied to clipboard