repmgr Server failure does not trigger automatic failover

Platform: CentOS 6 PostgreSQL version 9.6.9 repmgr version: 4.0.6

All packages were installed via yum/rpm.

There are three nodes: A master, a standby, and a witness. Doing a switchover works. Shutting down the postgresql process via "/etc/init.d/postgresql-9.6 stop" on the master results in the failure being detected and the standby being promoted. However, killing the VM that is running master results in ....nothing. The failure goes undetected.

Everything seems to be configured correctly and apart from known bugs #451 and #453 seems to be working.

Any help would be appreciated. thanks in advance.

Jun 28 '18 01:06 rpjeff

Can you provide the repmgr.conf and repmgrd log files from the standby?

Jun 28 '18 02:06 ibarwick

Worked the first time, so I tried it a second and it failed the test procedure was, node001 standby node006 master node003 witnsss

kill node006 (failover worked)
restart node006
switchover from node001 to node006
kill node006 (not detected)

repmgr..conf node_id=35 # A unique integer greater than zero node_name='node001' # An arbitrary (but unique) string; we recommend conninfo='host=node001.example.net user=repmgr dbname=repmgr port=5432' data_directory='/var/lib/pgsql/9.6/data/' witness_sync_interval=5 # interval (in seconds) to synchronise node records log_level=DEBUG # Log level: possible values are DEBUG, INFO, NOTICE, log_facility=STDERR # Logging facility: possible values are STDERR, or for log_file='/var/log/repmgr/repmgr-9.6.log' pg_bindir='/usr/pgsql-9.6/bin/' # Path to PostgreSQL binary directory (location pg_ctl_options='-D /var/lib/pgsql/9.6/data' # Options to append to "pg_ctl" rsync_options='--progress --rsh="ssh -o "StrictHostKeyChecking no""' ssh_options='-o "StrictHostKeyChecking no"' # Options to append to "ssh" promote_check_timeout=60 # The length of time (in seconds) to wait promote_check_interval=1 # The interval (in seconds) to check whether failover=automatic # one of 'automatic', 'manual'. reconnect_attempts=7 # Number attempts which will be made to reconnect to an unreachable reconnect_interval=3 # Interval between attempts to reconnect to an unreachable promote_command='/usr/pgsql-9.6/bin/repmgr standby promote -f /etc/repmgr/9.6/repmgr.conf' follow_command='/usr/pgsql-9.6/bin/repmgr standby follow -f /etc/repmgr/9.6/repmgr.conf --upstream-node-id=%n' service_start_command = 'sudo /usr/local/bin/repmgr96_start_pg.sh' service_stop_command = 'sudo /usr/local/bin/repmgr96_stop_pg.sh' service_restart_command = 'sudo /etc/init.d/postgresql-9.6 restart' service_reload_command = 'sudo /etc/init.d/postgresql-9.6 reload' service_promote_command = 'sudo /usr/local/bin/repmgr96_promote.sh'

I noticed this entry in the log, [2018-06-28 02:54:24] [WARNING] unable to create event record: ERROR: cannot execute INSERT in a read-only transaction

repmgr_testing.log

Jun 28 '18 03:06 rpjeff

Check in the /var/log/repmgr/repmgr-*.log file... as likely repmgrd is no longer monitoring. I have to restart all daemons after any switchover/failover.

The message I get is:

[2018-06-28 16:25:04] [WARNING] unable to connect to upstream node "postgresql01.internal" (node ID: 1)
[2018-06-28 16:25:14] [NOTICE] node has recovered, reconnecting
[2018-06-28 16:25:14] [NOTICE] reconnected to upstream node after 10 seconds
[2018-06-28 16:25:14] [WARNING] unable to create event record:
  ERROR:  cannot execute INSERT in a read-only transaction

Maybe I've got it misconfigured, but it seems to be trying to insert a record into the Standby, failing, and then it stops monitoring.

Jun 29 '18 13:06 gclough

Oh yes @rpjeff, I see we have the same problem. Possibly we have the same misconfiguration, or it's a problem with repmgr. As a workaround I just restart all repmgrd daemons on all servers after any switchover/failover. Unfortunately repmgr cluster crosscheck doesn't alert you to the problem, you have to check in repmgr-*.log.

@ibarwick, is this normal?

Jun 29 '18 13:06 gclough

@gclough Thanks for confirming it's not limited to myself. May be we can get someone involved in the repmgr project to comment on this?

Jul 03 '18 02:07 rpjeff

The team here are pretty responsive, so I suspect someone will investigate it soon. In the meantime, just ensure you restart repmgr daemon after every host swap, either switchover or failover.

Jul 03 '18 15:07 gclough

Worked the first time, so I tried it a second and it failed the test procedure was, node001 standby node006 master node003 witnsss

kill node006 (failover worked)

restart node006

switchover from node001 to node006

kill node006 (not detected)

During 3. switchover, was repmgrd running? If so, that would explain the issue, as repmgrd on node001 will probably not have noticed the switchover. This is a known issue and why the documentation states that repmgrd should not be running (maybe we should make that more prominent) on any node when the switchover is being carried out. You can confirm if that's happened by checking the repmgrd log on node001, assuming log level is at least INFO it will probably have lines like

[INFO] monitoring primary node "node001" (node ID: ....) in normal state

even after the switchover, and will not notice that node006 has gone away.

If that's not the case, please provide the repmgrd log from node001 from just before node006 was killed.

In other news, we here at the repmgr team (well, that's basically myself) are working on improved communication with the repmgrd processes to prevent this kind of thing being an issue in the first place, though due to the inevitable discrepancy between available temporal resources and required development time it might take a few months before it's ready. In the meantime I see one smaller change I could make which would mitigate the situation which should make it into the next release.

Jul 05 '18 08:07 ibarwick

Thanks @ibarwick , we appreciate your time on this. I must admit that I missed the section where repmgrd must be stopped before doing a switchover.

https://repmgr.org/docs/4.0/repmgr-standby-switchover.html

Execution Execute with the --dry-run option to test the switchover as far as possible without actually changing the status of either node.

Important: repmgrd should not be active on any nodes while a switchover is being executed. This restriction may be lifted in a later version.

I've been doing my switchovers with repmgrd running everywhere, and then just restarting it on all hosts afterwards. The doc doesn't explain why it needs to be shutdown, or any negative consequences of leaving it running... so I hope I'm not doing something incredibly dangerous.

Jul 05 '18 11:07 gclough

Thanks @ibarwick your help is much appreciated. I'll have to get back to you next week as that will be the first opportunity I'll have to check this. It's still a high priority for me as this is blocking a database upgrade for us. I'll also discuss it with others here and, assuming this work around works, see if they want to wait or go ahead. It should come down to whether this behaviour is predictable and easily automated to make it reliable as none of us are DBAs.

Jul 05 '18 23:07 rpjeff

Ran a couple of more tests in the first one I stopped repmgrd as @ibarwick instructed and gracefully stopped the postgresql process. In test 2 I got a little lazy and restarted repmgrd plus stopped then VM which was the postgresql master. This appears to work, but is painful. I can't think of a safe way to make sure that the repmgrd daemon is restarted after each event as it would be be repmgrd trying to restart itself. On a system using systemd I think you can have the process request itself to be restarted, but I'm on CentOS 6 with init.d. Open to suggestions.

Test 1

restart repmgrd on all nodes check start with repmgr show master is node006 stop postgresql on master with pg_ctl fails over to node001 restart postgresql on node001 let node006 sync shutdown repmgrd switchover run from node006 (standby node) master node006 repmgr show is correct on node001 (standby), node006 (master) repmgr show is incorrect on node003 (witness) restart repmgrd on all nodes repmgr correct on all nodes (witness still following node001) Force register node003 (witness) to follow node006.

Test 2

node001 (standby), node003(witness), node006(master) restart repmgrd on all nodes

stop VM node006 (master) witness spots failure, then standby. fails over to node001 and witness follows node001 (primary), node003(witness), node006 (off)

restart VM node006 node006 syncs and becomes standby

restart repmgrd on all nodes switchover node001(standby), node003(witness), node006(primary) witness did not follow (re-register)

restart repmgrd on all nodes stop VM node006 fails over to node001 node001 (primary), node003(witness), node006 (off)

restart node006 node006 syncs and becomes standby

restart repmgrd on all nodes node001(standby), node003(witness), node006(primary) witness did not follow (re-register) restart repmgr on all nodes

node001(standby), node003(witness), node006(primary) witness follows node006

Jul 12 '18 06:07 rpjeff

It occurred to me on the drive home yesterday that I really should have performed a third test that didn't restart repmgrd and didn't do switchovers as this is the crucial aspect. This works, so the problem was switcheovers as @ibarwick says. It's the failover that was of real concern as we wouldn't reconfigure the nodes back to the original state after a failover nor would we think the restart repmgrd. The switchover happens outside of repmgrd so I don't have to worry about it needing to restart itself. I already have a convenience script for switchover and could add the restart of repmgrd to this.

###test3 node001(standby) node003(witness) node006(primary) restart repmgrd on all nodes

stop VM node006 failure detected and node001 promoted node001(primary) node003(witness) node006(off) witness follows node001

restart VM node006 node006 resyncs and follows node001 node001(primary) node003(witness_ node006(standby)

stop VM node001 failure of detected and node006 promoted node001(off) node003(witness) node006(primary) witness follows node006

restart VM node001 node001 resyncs and follows node006 node001(standby) node003(witness) node006(primary)

stop VM node006 failure detected and node001 promoted node001(primary) node003(witness) node006(off) witness follows node001

restart VM node006 node006 resyncs and follows node001 node001(primary) node003(witness_ node006(standby)

stop VM node001 ailure of detected and node006 promoted node001(off) node003(witness) node006(primary) witness follows node006

restart VM node001 node001 resyncs and follows node006 node001(standby) node003(witness) node006(primary)

Jul 13 '18 01:07 rpjeff