pg_auto_failover Restart logic counts intended restarts as bad exit

Starting a monitor and running the following command 5 times in a row, causes the monitor to exit:

PG_AUTOCTL_DEBUG=true pg_autoctl do service restart listener --pgdata monitor

It exits with this log:

12:21:53 24163 FATAL pg_autoctl service listener has already been restarted 5 times in the last 10 seconds, stopping now
12:21:53 24167 INFO  Postgres controller service received signal SIGTERM, terminating
12:21:53 24167 INFO  Stopping pg_autoctl postgres service
12:21:53 24167 INFO  /home/jelte/.pgenv/pgsql/bin/pg_ctl --pgdata /home/jelte/work/pg_auto_failover/monitor --wait stop --mode fast
12:21:54 24163 FATAL Something went wrong in sub-process supervision, stopping now. See above for details.
12:21:54 24163 INFO  Stop pg_autoctl

I think we count intended restarts as bad restarts for some reason.

Jul 13 '20 10:07 JelteF

I think we count intended restarts as bad restarts for some reason.

The supervisor is designed to be very simple, it does not distinguish between exit status codes when a service is marked RP_PERMANENT. We just know that the service should be running, and restart it when it's not running anymore. So we don't have a notion of “intended” and “bad” restarts. I think it's working as designed.

Jul 13 '20 11:07 DimCitus

To be clear, this also happens when restarting node-active. We want to use restarting of node-active as the way to do upgrades of pg_autoctl without killing postgres in https://github.com/citusdata/pg_auto_failover/pull/296. For that case it seems quite weird that it would still kill postgres, if you run the command 5 times in a row.

Jul 13 '20 11:07 JelteF

It seems weird but also restarting the process 5 times in a row in less than 5 mins (default service max time) seems weird too. I'm still not clear that this is a bug, or how to improve the current defaults...

Jul 13 '20 12:07 DimCitus

Pull Request #734 introduces SUPERVISOR_EXIT_ERROR and SUPERVISOR_EXIT_FATAL situations with exit codes EXIT_CODE_DROPPED and EXIT_CODE_FATAL. We should be able to re-use that addition to distinguish more cases and improve error handling in cases. Although in this instance we don't want to interrupt for no reason.

See also #749 where we could check pg_autoctl version --json | jq .pgautofailover version string before deciding to exit for an upgrade. If the new binary is not in place yet, then certainly we don't want to exit yet.

That said, the general problem of restarting manually too frequently would need a special case to distinguish it from failure to restart the process too many times in a row. It might be that the current settings need adjustment.

#define SUPERVISOR_SERVICE_MAX_RETRY 5
#define SUPERVISOR_SERVICE_MAX_TIME 300 /* in seconds */

Jul 07 '21 11:07 DimCitus

pg_auto_failover pg_auto_failover copied to clipboard

Restart logic counts intended restarts as bad exit

pg_auto_failover
pg_auto_failover copied to clipboard