repmgr icon indicating copy to clipboard operation
repmgr copied to clipboard

repmgrd autofailover not working if PR is down with File system hang

Open nikhil-postgres opened this issue 10 months ago • 6 comments

Hi repmgr team,

We found a bug in repmgrd process. Whenever a primary database host is hung (Not able to perform any DML/DDL operations), the repmgrd process on the HA is running but not updating the log files. It is stuck

repmgrd process in Sleep state on standby:

PID    USER    PR NI VIRT  RES   SHR  S %CPU %MEM  TIME+   COMMAND 
249678 postgres 20 0 87312 10604 7324 S 0.0  0.0  1334:46 /usr/pgsql-15/bin/repmgrd -f /postgres/admin/pgrepmgr/5304/pgrepmgr_5304.conf —log-level DEBUG —daemonize 
 

When we see the connections on the Primary, the repmgr process (of the standby ) is stuck in trying to INSERT data into repmgr.monitoring_history table.

During this situation there is no autofailover, Is this a known issue? how can we make sure that repmgrd does auto failover in such situations?

Thanks, Nikhil

nikhil-postgres avatar Apr 16 '24 18:04 nikhil-postgres

Hi @ibarwick @martinmarques , Do you know why repmgrd is not doing autofailover ?

nikhil-postgres avatar Apr 17 '24 07:04 nikhil-postgres

Hi, i had a similar issue some time ago, resulting in "monitoring_history requested but primary connection not available" entries on the standby while no failover was happening (and therefore repmgrd continued to sleep). Since i had restarted repmgrd regularly, this didn't happen anymore. I not yet tried if this is still an issue in newer versions. Stephan

stephan-hahn avatar May 07 '24 14:05 stephan-hahn

Hi @stephan-hahn , newer versions also have the same issue but I don’t see any update from repmgr team. Is repmgr being actively developed or are the issues being looked into?

nikhil-postgres avatar May 30 '24 01:05 nikhil-postgres

I hope (and think so). We use a cluster solution based on repmgr.

Could you fix the problem with a restart (e.g. daily)?

stephan-hahn avatar Jun 10 '24 10:06 stephan-hahn

Yes, we can fix with a restart but being an autofailover solution, repmgr should be able to detect a system hang on primary and perform failover.

It is not performing the failover because repmgrd connections itself are hung on primary

nikhil-postgres avatar Jul 25 '24 11:07 nikhil-postgres

For us, it works perfectly so far, and it's a quite lightweighted solution and completely free. It would be interesting what a system hang means for you, and how do you cause it. You could try to change connection_check_type from ping to another option.

stephan-hahn avatar Aug 06 '24 14:08 stephan-hahn