pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

Monitor crashes immediately post-upgrade (1.4.2->1.6.2)

Open jchampio opened this issue 3 years ago • 4 comments

(This shares similarities with #810, but I can't tell whether they have the same root cause.)

We're seeing intermittent crashes when upgrading the monitor from 1.4.2 to 1.6.2:

TRAP: FailedAssertion("attnum <= tupleDesc->natts", File: "heaptuple.c", Line: 89)
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (ExceptionalCondition+0x80)[0x5624881373f5]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (getmissingattr+0x85)[0x562487c77c23]
/usr/local/pgsql/lib/pgautofailover.so(TupleToAutoFailoverNode+0x1c50)[0x7f5d17a2a346]
/usr/local/pgsql/lib/pgautofailover.so(SetNodeHealthState+0x1f5)[0x7f5d17a34308]
/usr/local/pgsql/lib/pgautofailover.so(+0xfeb0)[0x7f5d17a2feb0]
/usr/local/pgsql/lib/pgautofailover.so(+0x1034e)[0x7f5d17a3034e]
/usr/local/pgsql/lib/pgautofailover.so(HealthCheckWorkerMain+0x17d)[0x7f5d17a307ff]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (StartBackgroundWorker+0x245)[0x562487f70904]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d1748)[0x562487f7f748]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d18c4)[0x562487f7f8c4]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d20eb)[0x562487f800eb]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f5d1923d3c0]
/lib/x86_64-linux-gnu/libc.so.6(sigprocmask+0x4b)[0x7f5d18b0e48b]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d2a94)[0x562487f80a94]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (PostmasterMain+0x10f7)[0x562487f81dc7]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (main+0x210)[0x562487ecad9f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5d18aef0b3]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (_start+0x2e)[0x562487c6c14e]

Repro

It looks like there's a race condition in one of the background workers, so reproduction is tricky, but I've been able to reliably reproduce this stack trace with the following GUC settings on the monitor:

pgautofailover.health_check_period = 1000
pgautofailover.health_check_timeout = 100
pgautofailover.health_check_max_retries = 1
pgautofailover.health_check_retry_delay = 1

With the above settings for the monitor, do the following:

  • Create a monitor, a primary, and two secondaries with pg_auto_failover 1.4.
  • Stop the monitor and one of the secondaries (to force a health check transition when the monitor comes back online later).
  • Update to pg_auto_failover 1.6.
  • Start the monitor, but do not start the downed secondary.

As soon as the monitor notices the bad secondary, I get the assertion failure in the server logs.

RCA(?)

Best I can tell, SetNodeHealthState() is to blame:

void
SetNodeHealthState(int64 nodeId,
                   char *nodeName,
                   char *nodeHost,
                   uint16 nodePort,
                   int previousHealthState,
                   int healthState)
{
    ...

    if (HaMonitorHasBeenLoaded())
    {
        initStringInfo(&query);
        appendStringInfo(&query,
                         "UPDATE " AUTO_FAILOVER_NODE_TABLE
                         "   SET health = %d, healthchecktime = now() "
                         " WHERE nodeid = %lld "
                         "   AND nodehost = %s AND nodeport = %d "
                         " RETURNING node.*",
                         healthState,
                         (long long) nodeId,
                         quote_literal_cstr(nodeHost),
                         nodePort);

        pgstat_report_activity(STATE_RUNNING, query.data);

        spiStatus = SPI_execute(query.data, false, 0);
        Assert(spiStatus == SPI_OK_UPDATE_RETURNING);

        /*
         * We should have 0 or 1 row impacted, because of pkey on nodeid. We
         * might have updated zero rows when a node is concurrently being
         * DELETEd, because of the default REPETEABLE READ isolation level.
         */
        if (SPI_processed == 1)
        {
            if (healthState != previousHealthState)
            {
                HeapTuple heapTuple = SPI_tuptable->vals[0];
                AutoFailoverNode *pgAutoFailoverNode =
                    TupleToAutoFailoverNode(SPI_tuptable->tupdesc, heapTuple);

In this case, the SPI query performs a RETURNING node.* that returns a row from the 1.4 version of pg_auto_failover. This row has only 19 columns. Since (healthState != previousHealthState) (because we stopped one of the secondaries during the upgrade), we call TupleToAutoFailoverNode(), which expects a 1.6 tuple (with 21 columns). That code proceeds to walk off the end of the row and crash.

If the monitor is able to update to 1.6 before a node's health state changes, we win the race and do not see the crash, which is why having those GUC values pulled down helps with reproduction.

jchampio avatar Oct 19 '21 18:10 jchampio

Hi @jchampio ; thanks for your analysis and repro steps! Could you also work on a PR to fix it? Note that I will be away in vacations next week, so I will be able to review when back, so sometime in November.

DimCitus avatar Oct 22 '21 15:10 DimCitus

Hi @DimCitus, this week has been crazy and I haven't gotten a chance to look at it yet. I think the main question I have is what the strategy should be when the monitor notices that a previous version is installed -- do we issue backwards-compatible queries, or do we bail out and wait for the extension upgrade?

jchampio avatar Oct 28 '21 23:10 jchampio

I'm not sure how we would go with the backward-compability, it seems to me that we would have a AutoFailoverNode *pgAutoFailoverNode partially filled, and at the moment the code doesn't know about that. Looks quite complex to make this happen, for a dubious benefit.

I think bail-out and wait for the extension upgrade is the better path forward.

DimCitus avatar Nov 02 '21 10:11 DimCitus

If our 3 nodes went in bad state then how we can identify which will make primary in this case becase in 3 nides our postgres services is not running.

Rajeevshar avatar Mar 09 '24 14:03 Rajeevshar