repmgr icon indicating copy to clipboard operation
repmgr copied to clipboard

lag check during switchover

Open Tiago-Anastacio opened this issue 7 years ago • 3 comments

Hello,

I have the following issue about lag check during switchover:

Configuration is: rhel7 repmgr 4.0.6 PostgreSQL 9.5 one master and standby in asynchronous mode + restore_command using ssh to get archived xlogs.

Use case: I have a big outage network, during that wal_receiver_timeout and archiver crashed and because of archiver crash PostgreSQL said it will restart but it failed to restart, no logs about the restart (this is mainly due to postgres and I don't know why ...) Once network outage is solved, and standby started manually, Primary and standby could not synchronize because xlog were removed from pg_xlog and archived xlog where deleted from their directory.

The issue is: repmg standby switchover did not detect that situation when it checks the lags, it says ok while replication is broke.

I understand that on primary : repmgr check primary archive lag , and it was ok. But on standby lag is compute this way : select pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp(); Except if pg_catalog.pg_last_xlog_receive_location() = pg_catalog.pg_last_xlog_replay_location() And we are on the last case because there is no replication from a longtime.

Maybe using pg_stat_replication throught ssh on primary (or from postgres 9.6 pg_receive_wal on standby ) would fix that ?

Thank you

Tiago-Anastacio avatar Dec 21 '18 11:12 Tiago-Anastacio

Thanks for the report. I've committed a change which verifies there is a replication connection, which should catch this issue in future. It will be included in the upcoming 4.3 release.

ibarwick avatar Feb 01 '19 06:02 ibarwick

great news thank you.

Tiago-Anastacio avatar Feb 01 '19 10:02 Tiago-Anastacio

Actually I have two remarks 1 - does this apply to repmgr standby switchover --dry-run as well ? 2 - could it be possible to add this check as well on repmgr node check and repmgr node status ? Because we have the same issue. Thank you.

Tiago-Anastacio avatar Feb 04 '19 09:02 Tiago-Anastacio