repmgr
repmgr copied to clipboard
repmgrd autofailover not working if PR is down with File system hang
Hi repmgr team,
We found a bug in repmgrd process. Whenever a primary database host is hung (Not able to perform any DML/DDL operations), the repmgrd process on the HA is running but not updating the log files. It is stuck
repmgrd process in Sleep state on standby:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
249678 postgres 20 0 87312 10604 7324 S 0.0 0.0 1334:46 /usr/pgsql-15/bin/repmgrd -f /postgres/admin/pgrepmgr/5304/pgrepmgr_5304.conf —log-level DEBUG —daemonize
When we see the connections on the Primary, the repmgr process (of the standby ) is stuck in trying to INSERT data into repmgr.monitoring_history table.
During this situation there is no autofailover, Is this a known issue? how can we make sure that repmgrd does auto failover in such situations?
Thanks, Nikhil
Hi @ibarwick @martinmarques , Do you know why repmgrd is not doing autofailover ?
Hi, i had a similar issue some time ago, resulting in "monitoring_history requested but primary connection not available" entries on the standby while no failover was happening (and therefore repmgrd continued to sleep). Since i had restarted repmgrd regularly, this didn't happen anymore. I not yet tried if this is still an issue in newer versions. Stephan
Hi @stephan-hahn , newer versions also have the same issue but I don’t see any update from repmgr team. Is repmgr being actively developed or are the issues being looked into?
I hope (and think so). We use a cluster solution based on repmgr.
Could you fix the problem with a restart (e.g. daily)?
Yes, we can fix with a restart but being an autofailover solution, repmgr should be able to detect a system hang on primary and perform failover.
It is not performing the failover because repmgrd connections itself are hung on primary
For us, it works perfectly so far, and it's a quite lightweighted solution and completely free. It would be interesting what a system hang means for you, and how do you cause it. You could try to change connection_check_type from ping to another option.