repmgr Primary node outage shorter than monitor_interval

With a simple two-node primary->standby replication cluster, I have run into an issue where if the primary node very briefly goes down, the standby repmgrd instance's connection to the primary gets into a bad state and can't recover, requiring repmgrd to be restarted.

Specifically, if the primary node goes down (in a way that causes the existing connections to be broken) and is back up before the next iteration of the main while loop in monitor_streaming_standby happens, the logic to check the upstream connection will not properly detect that both the upstream_conn and primary_conn variables need to be updated.

For the setup I'm talking about, both upstream_conn and primary_connwill enter thewhile` loop with the same value (as the upstream is the primary). I tried to trace through the logic:

In repmgrd-physical.c, the check_upstream_connection call will attempt to verify upstream_conn is still valid
In repmgrd.c, the behavior varies based on connection_check_type:
- In ping and connection modes, the conninfo passed in is used to check server availability which will succeed as the server is up.
- In query mode, the existing connection is tested, which will fail. However, the query creates a new connection and retries, which succeeds. This will update upstream_conn, but this does not update primary_conn as well
In all cases, the method returns true which skips the else logic to attempt a reconnect.

After that point, assuming both primary/standby are healthy, the remaining logic in the loop may behave unexpectedly.

The way this first came up for me was with monitoring_history enabled, the block of code to write that history will repeatedly fail as primary_conn is busted but is never successfully reconnected. That manifests as the "monitoring_history requested but primary connection not available" log repeatedly showing up in the logs:

repmgrd: [2020-04-17 11:51:03] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:05] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:07] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:09] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:11] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:13] [WARNING] monitoring_history requested but primary connection not available

Restarting repmgrd on the standby will cause it to start the whole process over, and since the primary is actually reachable everything works as expected from there.

I'm happy to contribute a patch to resolve this if needed, I just wanted to write an issue up first to make sure that this is really a bug and I'm not just missing something.

Apr 18 '20 00:04 michaelvirag

Thanks for the report; we have confirmed/reproduced the issue and provided a fix. If you'd like to test, we can provide snapshot packages.

May 14 '20 01:05 ibarwick

Appreciate the response. I believe the patch resolves one aspect of the issue, but I don't think it solves the issue where as a result of the re-created connection, upstream_conn and primary_conn get out of sync. I put a comment on the commit itself, wanted to also post here in case that doesn't trigger notifications (apologies for 2x notification if it does).

May 18 '20 23:05 michaelvirag

Just noticed the latest round of changes, it looks good to me.

If you wanted to make snapshot packages available for testing I'd appreciate it! Otherwise can just create it on this end for internal testing.

May 27 '20 21:05 michaelvirag

Just noticed the latest round of changes, it looks good to me.

If you wanted to make snapshot packages available for testing I'd appreciate it! Otherwise can just create it on this end for internal testing.

We've created snapshot packages, which can be installed as described here: https://repmgr.org/docs/current/packages-snapshot.html

Thanks in advance for any feedback!

Jun 01 '20 05:06 ibarwick

Thanks for making the snapshot packages. Was able to test this morning and it looks like the issue has been resolved!

Replicated the previously failing scenario and the connection was re-created successfully. When running at the DEBUG log level, saw the expected log messages ("upstream is available but upstream connection has gone away, resetting" and "resetting paired connection").

Looks good to me.

Jun 02 '20 17:06 michaelvirag

Primary node outage shorter than monitor_interval_secs causes standby repmgrd issues