pg_auto_failover
pg_auto_failover copied to clipboard
test_multi_standbys: add failing unit test
Rewrite test_014 to exercise a "simultaneous" add/drop situation, which leads to a hung FSM. This does not fix the problem, just expose it. See #482 for full details.
One additional issue: the wait_until_state(join_primary) call is itself racing against the primary, which might have already transitioned out of that state by the time we reach the call. I'm not sure how to fix that without a lot more plumbing.
Based on code originally authored with @rheaton.
We were also dabbling with testing other valid primary states, such as wait_primary. But all the inherent races involved with dropping a secondary at the same time that a primary is in the middle of a multi-step transition out of single led us to tackle this one first.
I kind of liked the idea of driving the process manually with fsm assign and fsm step, I think that requires a setup without a monitor though nowadays. Do you want to investigate in that? You'd be writing the first coverage of the --disable-monitor feature, but it might be good if that allowed to construct difficult to reach racy situations and make sure we're good.
Do you want to investigate in that?
Sure! No promises on a quick turnaround, though; a couple of other plates are being juggled right now. 😄
@DimCitus I picked this back up and attempted to rebase it on top of the --disable-monitor features in #494.
Unfortunately, the code that has the bug under test is inside the monitor (RemoveNode()), so disabling it doesn't help us catch the error case. What I think I'd like to see is the ability to run a monitor, so that its code and interactions can be tested, but prevent it from making state transitions until I say so, in the vein of do fsm.
Hi @jchampio ; then maybe you should add an SQL regression test file to the monitor extension test suite over at https://github.com/citusdata/pg_auto_failover/blob/master/src/monitor/sql/monitor.sql. See also https://github.com/citusdata/pg_auto_failover/issues/464 for allowing make -C src/monitor installcheck to run on the Docker environment for the test suite. Even without that, we can make progress for this PR.
@DimCitus Looks like those regression tests aren't currently passing; git bisect shows 3d6f4af introduced the failures.