pg_auto_failover
pg_auto_failover copied to clipboard
Monitor crashes immediately post-upgrade (1.4.2->1.6.2)
(This shares similarities with #810, but I can't tell whether they have the same root cause.)
We're seeing intermittent crashes when upgrading the monitor from 1.4.2 to 1.6.2:
TRAP: FailedAssertion("attnum <= tupleDesc->natts", File: "heaptuple.c", Line: 89)
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (ExceptionalCondition+0x80)[0x5624881373f5]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (getmissingattr+0x85)[0x562487c77c23]
/usr/local/pgsql/lib/pgautofailover.so(TupleToAutoFailoverNode+0x1c50)[0x7f5d17a2a346]
/usr/local/pgsql/lib/pgautofailover.so(SetNodeHealthState+0x1f5)[0x7f5d17a34308]
/usr/local/pgsql/lib/pgautofailover.so(+0xfeb0)[0x7f5d17a2feb0]
/usr/local/pgsql/lib/pgautofailover.so(+0x1034e)[0x7f5d17a3034e]
/usr/local/pgsql/lib/pgautofailover.so(HealthCheckWorkerMain+0x17d)[0x7f5d17a307ff]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (StartBackgroundWorker+0x245)[0x562487f70904]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d1748)[0x562487f7f748]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d18c4)[0x562487f7f8c4]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d20eb)[0x562487f800eb]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f5d1923d3c0]
/lib/x86_64-linux-gnu/libc.so.6(sigprocmask+0x4b)[0x7f5d18b0e48b]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d2a94)[0x562487f80a94]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (PostmasterMain+0x10f7)[0x562487f81dc7]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (main+0x210)[0x562487ecad9f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5d18aef0b3]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (_start+0x2e)[0x562487c6c14e]
Repro
It looks like there's a race condition in one of the background workers, so reproduction is tricky, but I've been able to reliably reproduce this stack trace with the following GUC settings on the monitor:
pgautofailover.health_check_period = 1000
pgautofailover.health_check_timeout = 100
pgautofailover.health_check_max_retries = 1
pgautofailover.health_check_retry_delay = 1
With the above settings for the monitor, do the following:
- Create a monitor, a primary, and two secondaries with pg_auto_failover 1.4.
- Stop the monitor and one of the secondaries (to force a health check transition when the monitor comes back online later).
- Update to pg_auto_failover 1.6.
- Start the monitor, but do not start the downed secondary.
As soon as the monitor notices the bad secondary, I get the assertion failure in the server logs.
RCA(?)
Best I can tell, SetNodeHealthState()
is to blame:
void
SetNodeHealthState(int64 nodeId,
char *nodeName,
char *nodeHost,
uint16 nodePort,
int previousHealthState,
int healthState)
{
...
if (HaMonitorHasBeenLoaded())
{
initStringInfo(&query);
appendStringInfo(&query,
"UPDATE " AUTO_FAILOVER_NODE_TABLE
" SET health = %d, healthchecktime = now() "
" WHERE nodeid = %lld "
" AND nodehost = %s AND nodeport = %d "
" RETURNING node.*",
healthState,
(long long) nodeId,
quote_literal_cstr(nodeHost),
nodePort);
pgstat_report_activity(STATE_RUNNING, query.data);
spiStatus = SPI_execute(query.data, false, 0);
Assert(spiStatus == SPI_OK_UPDATE_RETURNING);
/*
* We should have 0 or 1 row impacted, because of pkey on nodeid. We
* might have updated zero rows when a node is concurrently being
* DELETEd, because of the default REPETEABLE READ isolation level.
*/
if (SPI_processed == 1)
{
if (healthState != previousHealthState)
{
HeapTuple heapTuple = SPI_tuptable->vals[0];
AutoFailoverNode *pgAutoFailoverNode =
TupleToAutoFailoverNode(SPI_tuptable->tupdesc, heapTuple);
In this case, the SPI query performs a RETURNING node.*
that returns a row from the 1.4 version of pg_auto_failover. This row has only 19 columns. Since (healthState != previousHealthState)
(because we stopped one of the secondaries during the upgrade), we call TupleToAutoFailoverNode()
, which expects a 1.6 tuple (with 21 columns). That code proceeds to walk off the end of the row and crash.
If the monitor is able to update to 1.6 before a node's health state changes, we win the race and do not see the crash, which is why having those GUC values pulled down helps with reproduction.
Hi @jchampio ; thanks for your analysis and repro steps! Could you also work on a PR to fix it? Note that I will be away in vacations next week, so I will be able to review when back, so sometime in November.
Hi @DimCitus, this week has been crazy and I haven't gotten a chance to look at it yet. I think the main question I have is what the strategy should be when the monitor notices that a previous version is installed -- do we issue backwards-compatible queries, or do we bail out and wait for the extension upgrade?
I'm not sure how we would go with the backward-compability, it seems to me that we would have a AutoFailoverNode *pgAutoFailoverNode
partially filled, and at the moment the code doesn't know about that. Looks quite complex to make this happen, for a dubious benefit.
I think bail-out and wait for the extension upgrade is the better path forward.
If our 3 nodes went in bad state then how we can identify which will make primary in this case becase in 3 nides our postgres services is not running.