Primary node outage shorter than monitor_interval_secs causes standby repmgrd issues
With a simple two-node primary->standby replication cluster, I have run into an issue where if the primary node very briefly goes down, the standby repmgrd instance's connection to the primary gets into a bad state and can't recover, requiring repmgrd to be restarted.
Specifically, if the primary node goes down (in a way that causes the existing connections to be broken) and is back up before the next iteration of the main while loop in monitor_streaming_standby happens, the logic to check the upstream connection will not properly detect that both the upstream_conn and primary_conn variables need to be updated.
For the setup I'm talking about, both upstream_conn and primary_connwill enter thewhile` loop with the same value (as the upstream is the primary). I tried to trace through the logic:
- In repmgrd-physical.c, the
check_upstream_connectioncall will attempt to verifyupstream_connis still valid - In repmgrd.c, the behavior varies based on
connection_check_type:- In
pingandconnectionmodes, theconninfopassed in is used to check server availability which will succeed as the server is up. - In
querymode, the existing connection is tested, which will fail. However, the query creates a new connection and retries, which succeeds. This will updateupstream_conn, but this does not updateprimary_connas well
- In
- In all cases, the method returns
truewhich skips theelselogic to attempt a reconnect.
After that point, assuming both primary/standby are healthy, the remaining logic in the loop may behave unexpectedly.
The way this first came up for me was with monitoring_history enabled, the block of code to write that history will repeatedly fail as primary_conn is busted but is never successfully reconnected. That manifests as the "monitoring_history requested but primary connection not available" log repeatedly showing up in the logs:
repmgrd: [2020-04-17 11:51:03] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:05] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:07] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:09] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:11] [WARNING] monitoring_history requested but primary connection not available
repmgrd: [2020-04-17 11:51:13] [WARNING] monitoring_history requested but primary connection not available
Restarting repmgrd on the standby will cause it to start the whole process over, and since the primary is actually reachable everything works as expected from there.
I'm happy to contribute a patch to resolve this if needed, I just wanted to write an issue up first to make sure that this is really a bug and I'm not just missing something.
Thanks for the report; we have confirmed/reproduced the issue and provided a fix. If you'd like to test, we can provide snapshot packages.
Appreciate the response. I believe the patch resolves one aspect of the issue, but I don't think it solves the issue where as a result of the re-created connection, upstream_conn and primary_conn get out of sync. I put a comment on the commit itself, wanted to also post here in case that doesn't trigger notifications (apologies for 2x notification if it does).
Just noticed the latest round of changes, it looks good to me.
If you wanted to make snapshot packages available for testing I'd appreciate it! Otherwise can just create it on this end for internal testing.
Just noticed the latest round of changes, it looks good to me.
If you wanted to make snapshot packages available for testing I'd appreciate it! Otherwise can just create it on this end for internal testing.
We've created snapshot packages, which can be installed as described here: https://repmgr.org/docs/current/packages-snapshot.html
Thanks in advance for any feedback!
Thanks for making the snapshot packages. Was able to test this morning and it looks like the issue has been resolved!
Replicated the previously failing scenario and the connection was re-created successfully. When running at the DEBUG log level, saw the expected log messages ("upstream is available but upstream connection has gone away, resetting" and "resetting paired connection").
Looks good to me.