pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

Dropping nodes during `join_primary` results in illegal state transition

Open jchampio opened this issue 5 years ago • 1 comments

After #480, dropping a node at the same time that the primary is in the join_primary state results in a hung transition.

Here's a sample state log:

2020-10-29 17:15:52.809786+00:00   node_5      wait_standby/wait_standby      unknown        0/0 node 5 "node_5" (172.27.1.7:5432) reported new state "wait_standby"
2020-10-29 17:15:52.827077+00:00   node_1           primary/join_primary         sync  0/6012978 Setting goal state of node 1 "node_1" (172.27.1.3:5432) to join_primary after node 5 "node_5" (172.27.1.7:5432) joined.
2020-10-29 17:15:52.879759+00:00   node_1      join_primary/join_primary         sync  0/6012978 node 1 "node_1" (172.27.1.3:5432) reported new state "join_primary"
2020-10-29 17:15:52.884601+00:00   node_5      wait_standby/catchingup        unknown        0/0 Setting goal state of node 5 "node_5" (172.27.1.7:5432) to catchingup after node 1 "node_1" (172.27.1.3:5432) converged to wait_primary.
2020-10-29 17:15:55.188883+00:00   node_5        catchingup/catchingup        unknown  0/8000000 node 5 "node_5" (172.27.1.7:5432) reported new state "catchingup"
2020-10-29 17:15:55.738690+00:00   node_1      join_primary/apply_settings       sync  0/8000000 Setting goal state of node 1 "node_1" (172.27.1.3:5432) to apply_settings after removing standby node 3 "node_3" (172.27.1.5:5432) from formation default.

And the failure message from the primary:

17:17:22 199 FATAL fsm.c:514 pg_autoctl does not know how to reach state "apply_settings" from "join_primary"
17:17:22 199 ERROR service_keeper.c:461 Failed to transition to state "apply_settings", retrying...

Draft PR with a failing unit test is ~forthcoming~ here.

jchampio avatar Oct 29 '20 17:10 jchampio

Hi,

Is there some way to force the primary in 'join_primary' state to transition back to 'primary'? I've changed node priority while primary was still in 'join_primary' state and this is what I've ended up with:

postgres@database-pgautoctl-monitor:~$ pg_autoctl show state --pgdata monitor --formation cluster-REDACTED
         Name |  Node |                    Host:Port |          TLI: LSN |   Connection |       Current State |      Assigned State
--------------+-------+------------------------------+-------------------+--------------+---------------------+--------------------
REDACTED-1-50 |    11 | database-REDACTED-1-50:10050 |   3: 507/61A84948 |   read-write |        join_primary |      apply_settings
REDACTED-1-15 |    12 |       database-REDACTED-1-15:10050 |   3: 507/61A806F8 |    read-only |           secondary |           secondary
REDACTED-2-50 |    13 | database-REDACTED-2-50:10050 |   3: 507/61A84660 |    read-only |          catchingup |          catchingup
Oct 10 11:13:58 database-REDACTED-1-50 pg_autoctl[1787805]: 11:13:58 1787805 FATAL pg_autoctl does not know how to reach state "apply_settings" from "join_primary"
Oct 10 11:13:58 database-REDACTED-1-50 pg_autoctl[1787805]: 11:13:58 1787805 ERROR Failed to transition to state "apply_settings", retrying...

What would be the safest way to recover from this?

Akkowicz avatar Oct 10 '21 09:10 Akkowicz