The citus cluster has collapsed
For an unknown reason, the cluster stopped working. His condition is: Coordinator: nodeid groupid nodename hasmetadata isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE TRUE FALSE TRUE 51 12 worker23 TRUE TRUE TRUE TRUE
pg_dist_node is locked by ExclusiveLock with process without pid (because of that it is not possible to do anything with metadata: meta sync for example ) It is advisory lock - I do not now how to kill that lock ?
Worker21: nodeid groupid nodename isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE TRUE TRUE 51 12 worker22 TRUE TRUE TRUE
the worker22 is not active (existing) worker. postgres log: WARNING: connection to the remote node worker22:**** failed with the following error: server closed the connection unexpectedly
Worker23: nodeid groupid nodename isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE FALSE TRUE 51 12 worker23 TRUE TRUE TRUE
It is possible build the cluster?
Can you see the locks following https://wiki.postgresql.org/wiki/Lock_Monitoring instructions?
if you try to synchronize metadata: SELECT start_metadata_sync_to_all_nodes();
select relation::regclass, * from pg_locks where not granted;
relation locktype database relation-2 page tuple virtualxid transactionid classid objid objsubid virtualtransaction pid mode granted fastpath waitstart pg_dist_node relation 16385 16546 NULL NULL NULL NULL NULL NULL NULL 9/4571 27664 RowShareLock FALSE FALSE 2024-08-06 16:54:35.652722+00
because of advisory lock is granted:
You can try to use this query https://wiki.postgresql.org/wiki/Lock_Monitoring#:~:text=information%20in%20pg_stat_activity-,%D0%A1ombination%20of%20blocked%20and%20blocking%20activity,-The%20following%20query
to find the blocking pid.
I killed the prepared transaction and now see:
coordinator: the process 9420 "Citus Metadata Sync Daemon" is blocking pg_dist_node
workers 0 and 3 (they have metadatasynced = false), the process has started:
after some days nothing happens: pg_dist_node is locked (on coordinator), worker 0 and 3 wait event "SyncRep"
I fixed worker 0 and 3.
there is still a problem with the coordinator.
the process "Citus Metadata Sync Daemon" is blocking pg_dist_node
pg_dist_node has metadatasynced = false
when I try to kill a process "Citus Metadata Sync Daemon", it restarts and continues to block pg_dist_node.
It is possible to fix this problem?
Hello @emelsimsek.
Can you help me? It is very critical for us.
the worker22 is not active (existing) worker. postgres log: WARNING: connection to the remote node worker22:**** failed with the following error: server closed the connection unexpectedly
You probably want to fix that so citus can proceed...
Yes. Now we have one problem: the "Citus Metadata Sync Daemon" is blocking pg_dist_node.