citus icon indicating copy to clipboard operation
citus copied to clipboard

citus_move_shard_placement() uses lots of memory on the coordinator

Open pingw33n opened this issue 4 months ago • 2 comments

On an idle Citus cluster citus_move_shard_placement() to move shards from one worker to another (neither is the coordinator) peaks at 70 GB memory usage on the coordinator. In fact, we were unable to have single citus_move_shard_placement() complete on a non-idle cluster due to the OOM kills. Below is the memory usage as reported by Kubernetes while citus_move_shard_placement() was running.

Image

This is somewhat surprising and looks like a leak since in this case the coordinator is not expected to take part in shard movement beyond orchestrating it.

Postgres version:

PostgreSQL 15.13 (Debian 15.13-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

Citus version:

Citus 13.1.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

Settings:

|name                                   |setting|unit|
|---------------------------------------|-------|----|
|autovacuum_work_mem                    |1048576|kB  |
|citus.background_task_queue_interval   |5000   |ms  |
|citus.defer_shard_delete_interval      |15000  |ms  |
|citus.max_cached_connection_lifetime   |600000 |ms  |
|citus.max_intermediate_result_size     |-1     |kB  |
|citus.max_matview_size_to_auto_recreate|1024   |MB  |
|citus.node_connection_timeout          |30000  |ms  |
|citus.recover_2pc_interval             |60000  |ms  |
|citus.remote_task_check_interval       |10     |ms  |
|effective_cache_size                   |9437184|8kB |
|logical_decoding_work_mem              |65536  |kB  |
|maintenance_work_mem                   |2097152|kB  |
|min_dynamic_shared_memory              |0      |MB  |
|shared_buffers                         |1048576|8kB |
|shared_memory_size                     |8543   |MB  |
|work_mem                               |524288 |kB  |

Cluster topology:

|nodeid    |groupid   |nodename     |nodeport      |noderack|hasmetadata|isactive|noderole|nodecluster|metadatasynced|shouldhaveshards|
|----------|----------|-------------|--------------|--------|-----------|--------|--------|-----------|--------------|----------------|
|1         |0         |coordinator  |5,432.00000000|default |true       |true    |primary |default    |true          |false           |
|2         |1         |worker-1     |5,432.00000000|default |true       |true    |primary |default    |true          |true            |
|3         |2         |worker-2     |5,432.00000000|default |true       |true    |primary |default    |true          |true            |
|4         |3         |worker-3     |5,432.00000000|default |true       |true    |primary |default    |true          |true            |
|5         |4         |worker-4     |5,432.00000000|default |true       |true    |primary |default    |true          |true            |

Citus cluster is running inside bare metal Kubernetes. Coordinator's memory requests/limits are 128/136 GB respectively. Kubernetes node that runs coordinator has 1 TB of physical RAM.

Database size is about 80 TB, distributed to 16 shards, almost all of the tables are collocated. The tables are themselves partitioned, so there's about 120000 shards overall.

pingw33n avatar Aug 22 '25 08:08 pingw33n

Could you share the outputs of the following:

SHOW citus.max_background_task_executors_per_node;
SHOW citus.metadata_sync_mode;
SELECT logicalrelid::regclass AS table,
partmethod, -- n = reference, h = hash-distributed
colocationid, -- same id ⇒ same shard map
repmodel
FROM pg_dist_partition
ORDER BY 1;

SELECT * FROM get_rebalance_table_shards_plan('<tbl_name>');

see if rebalance_table_shards() would result differently

Could you also share the exact command/optional parameters you use while calling citus_move_shard_placement()

ihalatci avatar Oct 02 '25 12:10 ihalatci

Unfortunately this environment is long gone so I can't provide much details.

SHOW citus.max_background_task_executors_per_node; SHOW citus.metadata_sync_mode;

I believe these had default values.

Could you also share the exact command/optional parameters you use while calling citus_move_shard_placement()

There's one optional parameter shard_transfer_mode which was set to auto and it used logical replication for the job.

pingw33n avatar Oct 06 '25 13:10 pingw33n