citus icon indicating copy to clipboard operation
citus copied to clipboard

The citus cluster has collapsed

Open YMikheev opened this issue 1 year ago • 9 comments

For an unknown reason, the cluster stopped working. His condition is: Coordinator: nodeid groupid nodename hasmetadata isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE TRUE FALSE TRUE 51 12 worker23 TRUE TRUE TRUE TRUE

pg_dist_node is locked by ExclusiveLock with process without pid (because of that it is not possible to do anything with metadata: meta sync for example ) It is advisory lock - I do not now how to kill that lock ?

Worker21: nodeid groupid nodename isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE TRUE TRUE 51 12 worker22 TRUE TRUE TRUE

the worker22 is not active (existing) worker. postgres log: WARNING: connection to the remote node worker22:**** failed with the following error: server closed the connection unexpectedly

Worker23: nodeid groupid nodename isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE FALSE TRUE 51 12 worker23 TRUE TRUE TRUE

It is possible build the cluster?

YMikheev avatar Aug 02 '24 16:08 YMikheev

Can you see the locks following https://wiki.postgresql.org/wiki/Lock_Monitoring instructions?

emelsimsek avatar Aug 06 '24 06:08 emelsimsek

if you try to synchronize metadata: SELECT start_metadata_sync_to_all_nodes();

select relation::regclass, * from pg_locks where not granted;

relation locktype database relation-2 page tuple virtualxid transactionid classid objid objsubid virtualtransaction pid mode granted fastpath waitstart pg_dist_node relation 16385 16546 NULL NULL NULL NULL NULL NULL NULL 9/4571 27664 RowShareLock FALSE FALSE 2024-08-06 16:54:35.652722+00

because of advisory lock is granted: Screenshot 2024-08-06 at 19 51 16

YMikheev avatar Aug 06 '24 17:08 YMikheev

You can try to use this query https://wiki.postgresql.org/wiki/Lock_Monitoring#:~:text=information%20in%20pg_stat_activity-,%D0%A1ombination%20of%20blocked%20and%20blocking%20activity,-The%20following%20query

to find the blocking pid.

emelsimsek avatar Aug 12 '24 09:08 emelsimsek

I killed the prepared transaction and now see: coordinator: the process 9420 "Citus Metadata Sync Daemon" is blocking pg_dist_node workers 0 and 3 (they have metadatasynced = false), the process has started: Picture1

after some days nothing happens: pg_dist_node is locked (on coordinator), worker 0 and 3 wait event "SyncRep"

YMikheev avatar Aug 26 '24 15:08 YMikheev

I fixed worker 0 and 3.

YMikheev avatar Aug 27 '24 18:08 YMikheev

there is still a problem with the coordinator. the process "Citus Metadata Sync Daemon" is blocking pg_dist_node pg_dist_node has metadatasynced = false Screenshot 2024-08-27 at 21 48 55

when I try to kill a process "Citus Metadata Sync Daemon", it restarts and continues to block pg_dist_node.

It is possible to fix this problem?

YMikheev avatar Aug 27 '24 18:08 YMikheev

Hello @emelsimsek.

Can you help me? It is very critical for us.

YMikheev avatar Sep 02 '24 12:09 YMikheev

the worker22 is not active (existing) worker. postgres log: WARNING: connection to the remote node worker22:**** failed with the following error: server closed the connection unexpectedly

You probably want to fix that so citus can proceed...

c2main avatar Sep 12 '24 18:09 c2main

Yes. Now we have one problem: the "Citus Metadata Sync Daemon" is blocking pg_dist_node.

YMikheev avatar Sep 13 '24 16:09 YMikheev