citus icon indicating copy to clipboard operation
citus copied to clipboard

Improve performance of creating a distributed table

Open thanodnl opened this issue 5 years ago • 1 comments

Creating distributed tables have an inefficiency that causes degradation of performance when shards counts increase. The performance seems to have O(n^2) characteristics.

Shard creation is easily assumed a non performance critical workload. In a lot of cases the creation of shards is done only once with a limited amount of shards, often having data in them which will dominate the time spent in distributing a table.

However with bigger installations we see higher shard counts (n from above) causing the problem to become more noticeable. If this were non-empty tables requiring data transfer the shard creation time would stay negligible, however more and more often we see postgres partition to be used. This increases the problems in two directions;

  • Tables created are mostly empty
  • Tables are created during normal operation of the database (eg. pg_partman)

Postgres made a big improvement over the years in the area of partitioned tables. In PG11 a heavy lock (ACCESS EXCLUSIVE) is acquired on the parent table. Conflicting with any production workload on the table. In PG12 this has been reduced to a SHARE UPDATE EXCLUSIVE which allows reads and write to continue and only blocks DDL. Unfortunately not all customers have the ability to upgrade (eg. Hyperscale is still on pg11).

It is a common pattern to have hourly partitions in a distributed table if data sizes become big. There have been observed workloads where the shard creation cause a 3 minute read/write block every hour to allow attachment of new partitions. Most of that time is spent in acquiring the lock.

In such cases a solution could be to create shards less often and more shards at a time. This could reduce the lock contention from hourly to weekly, but now shard creation inefficiencies become a measurable thing.

A local experiment running 10 workers with 400 shards could take 4-5 minutes to create a week worth of postgres partitions. Example below (requires pg_partman)

CREATE SCHEMA partman;
CREATE EXTENSION pg_partman SCHEMA partman;

SET citus.shard_count TO 400;

CREATE TABLE github_events
(
    event_id bigint,
    event_type text,
    event_public boolean,
    repo_id bigint,
    payload jsonb,
    repo jsonb,
    user_id bigint,
    org jsonb,
    created_at timestamp
) PARTITION BY RANGE (created_at);

SELECT partman.create_parent('public.github_events', 'created_at', 'native', 'hourly');

SELECT create_distributed_table('github_events', 'user_id');

DO $$
DECLARE
	partition_time timestamptz := '2020-05-13 12:00:00'::timestamptz;
	i int;
BEGIN
    FOR i IN SELECT generate_series(1, 24*7)
    LOOP
    	partition_time := partition_time + '1hour'::interval;
        PERFORM partman.create_partition_time('public.github_events', array[partition_time]);
    END LOOP;
END$$;

Runtime of the do block on my system:

Time: 295109.593 ms (04:55.110)

The inefficiencies are believed to be caused by repeatedly calling InsertShardPlacementRow and LoadShardPlacement. The first one inserts 1 row to describe a new call and issue a cache invalidation. The second one reads cache object for the newly inserted row. Due to the previous cache invalidation the cache needs to be rebuild every time, and with every insert the cache to rebuild becomes bigger.

A prototype that batches the inserts before flushing the cache showed great improvement for tables that create 60k+ shards https://github.com/citusdata/citus/compare/moonshot/60k-shards . Before the changes it was simply impossible to create that many shards.

With this patch the runtime above got reduced to Time: 256347.272 ms (04:16.347). In a bigger environment where not all nodes contend for the same resources the improvement might be more substantial.

thanodnl avatar May 15 '20 12:05 thanodnl

A huge memory usage with high shards:

MetadataCacheMemoryContext: 5697978576 total in 16380 blocks; 3140512 free (399 chunks); 5694838064 used

SaitTalhaNisanci avatar Mar 16 '21 14:03 SaitTalhaNisanci