pg_auto_failover
pg_auto_failover copied to clipboard
Dropping nodes during `join_primary` results in illegal state transition
After #480, dropping a node at the same time that the primary is in the join_primary state results in a hung transition.
Here's a sample state log:
2020-10-29 17:15:52.809786+00:00 node_5 wait_standby/wait_standby unknown 0/0 node 5 "node_5" (172.27.1.7:5432) reported new state "wait_standby"
2020-10-29 17:15:52.827077+00:00 node_1 primary/join_primary sync 0/6012978 Setting goal state of node 1 "node_1" (172.27.1.3:5432) to join_primary after node 5 "node_5" (172.27.1.7:5432) joined.
2020-10-29 17:15:52.879759+00:00 node_1 join_primary/join_primary sync 0/6012978 node 1 "node_1" (172.27.1.3:5432) reported new state "join_primary"
2020-10-29 17:15:52.884601+00:00 node_5 wait_standby/catchingup unknown 0/0 Setting goal state of node 5 "node_5" (172.27.1.7:5432) to catchingup after node 1 "node_1" (172.27.1.3:5432) converged to wait_primary.
2020-10-29 17:15:55.188883+00:00 node_5 catchingup/catchingup unknown 0/8000000 node 5 "node_5" (172.27.1.7:5432) reported new state "catchingup"
2020-10-29 17:15:55.738690+00:00 node_1 join_primary/apply_settings sync 0/8000000 Setting goal state of node 1 "node_1" (172.27.1.3:5432) to apply_settings after removing standby node 3 "node_3" (172.27.1.5:5432) from formation default.
And the failure message from the primary:
17:17:22 199 FATAL fsm.c:514 pg_autoctl does not know how to reach state "apply_settings" from "join_primary"
17:17:22 199 ERROR service_keeper.c:461 Failed to transition to state "apply_settings", retrying...
Draft PR with a failing unit test is ~forthcoming~ here.
Hi,
Is there some way to force the primary in 'join_primary' state to transition back to 'primary'? I've changed node priority while primary was still in 'join_primary' state and this is what I've ended up with:
postgres@database-pgautoctl-monitor:~$ pg_autoctl show state --pgdata monitor --formation cluster-REDACTED
Name | Node | Host:Port | TLI: LSN | Connection | Current State | Assigned State
--------------+-------+------------------------------+-------------------+--------------+---------------------+--------------------
REDACTED-1-50 | 11 | database-REDACTED-1-50:10050 | 3: 507/61A84948 | read-write | join_primary | apply_settings
REDACTED-1-15 | 12 | database-REDACTED-1-15:10050 | 3: 507/61A806F8 | read-only | secondary | secondary
REDACTED-2-50 | 13 | database-REDACTED-2-50:10050 | 3: 507/61A84660 | read-only | catchingup | catchingup
Oct 10 11:13:58 database-REDACTED-1-50 pg_autoctl[1787805]: 11:13:58 1787805 FATAL pg_autoctl does not know how to reach state "apply_settings" from "join_primary"
Oct 10 11:13:58 database-REDACTED-1-50 pg_autoctl[1787805]: 11:13:58 1787805 ERROR Failed to transition to state "apply_settings", retrying...
What would be the safest way to recover from this?