pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

2 nodes + witness = 3 data centers (problem case detected)

Open sgrinko opened this issue 1 year ago • 2 comments

Hi, Thank you developers for your work!

Now about the problem :)

There are 3 DCs:

  • 2 of them have nodes (primary and secondary)
  • in the 3rd DC we have witness

In the current configuration, we have synchronous replication between nodes.

If we break the connection between nodes, but witness successfully sees each node (other DCs), then synchronous replication is not automatically removed. This causes requests to hang on the commit command. We cannot wait until lag is accumulated for the witness response.

Is it possible to respond to such a failure of network availability?

sgrinko avatar Jun 20 '23 10:06 sgrinko

Good afternoon, developers.

This moment is very important and critical for us, as it does not allow us to place the monitor in the third data center at the moment.

Perhaps, as an option, you need keeper to check for replicas connected to it on the primary data node, and if there is none, report this to the monitor and switch replication from synchronous to asynchronous mode.

xinferum avatar Jun 20 '23 10:06 xinferum

It should be possible to see a missing row in pg_stat_replication on the primary node and assign wait_primary from there.

dimitri avatar Sep 25 '23 17:09 dimitri