pg_auto_failover
pg_auto_failover copied to clipboard
Restart logic counts intended restarts as bad exit
Starting a monitor and running the following command 5 times in a row, causes the monitor to exit:
PG_AUTOCTL_DEBUG=true pg_autoctl do service restart listener --pgdata monitor
It exits with this log:
12:21:53 24163 FATAL pg_autoctl service listener has already been restarted 5 times in the last 10 seconds, stopping now
12:21:53 24167 INFO Postgres controller service received signal SIGTERM, terminating
12:21:53 24167 INFO Stopping pg_autoctl postgres service
12:21:53 24167 INFO /home/jelte/.pgenv/pgsql/bin/pg_ctl --pgdata /home/jelte/work/pg_auto_failover/monitor --wait stop --mode fast
12:21:54 24163 FATAL Something went wrong in sub-process supervision, stopping now. See above for details.
12:21:54 24163 INFO Stop pg_autoctl
I think we count intended restarts as bad restarts for some reason.
I think we count intended restarts as bad restarts for some reason.
The supervisor is designed to be very simple, it does not distinguish between exit status codes when a service is marked RP_PERMANENT
. We just know that the service should be running, and restart it when it's not running anymore. So we don't have a notion of “intended” and “bad” restarts. I think it's working as designed.
To be clear, this also happens when restarting node-active
. We want to use restarting of node-active
as the way to do upgrades of pg_autoctl without killing postgres in https://github.com/citusdata/pg_auto_failover/pull/296. For that case it seems quite weird that it would still kill postgres, if you run the command 5 times in a row.
It seems weird but also restarting the process 5 times in a row in less than 5 mins (default service max time) seems weird too. I'm still not clear that this is a bug, or how to improve the current defaults...
Pull Request #734 introduces SUPERVISOR_EXIT_ERROR
and SUPERVISOR_EXIT_FATAL
situations with exit codes EXIT_CODE_DROPPED
and EXIT_CODE_FATAL
. We should be able to re-use that addition to distinguish more cases and improve error handling in cases. Although in this instance we don't want to interrupt for no reason.
See also #749 where we could check pg_autoctl version --json | jq .pgautofailover
version string before deciding to exit for an upgrade. If the new binary is not in place yet, then certainly we don't want to exit yet.
That said, the general problem of restarting manually too frequently would need a special case to distinguish it from failure to restart the process too many times in a row. It might be that the current settings need adjustment.
#define SUPERVISOR_SERVICE_MAX_RETRY 5
#define SUPERVISOR_SERVICE_MAX_TIME 300 /* in seconds */