pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

test_multi_standbys: add failing unit test

Open jchampio opened this issue 5 years ago • 7 comments

Rewrite test_014 to exercise a "simultaneous" add/drop situation, which leads to a hung FSM. This does not fix the problem, just expose it. See #482 for full details.

One additional issue: the wait_until_state(join_primary) call is itself racing against the primary, which might have already transitioned out of that state by the time we reach the call. I'm not sure how to fix that without a lot more plumbing.

Based on code originally authored with @rheaton.

jchampio avatar Oct 29 '20 17:10 jchampio

CLA assistant check
All CLA requirements met.

ghost avatar Oct 29 '20 17:10 ghost

We were also dabbling with testing other valid primary states, such as wait_primary. But all the inherent races involved with dropping a secondary at the same time that a primary is in the middle of a multi-step transition out of single led us to tackle this one first.

jchampio avatar Oct 29 '20 17:10 jchampio

I kind of liked the idea of driving the process manually with fsm assign and fsm step, I think that requires a setup without a monitor though nowadays. Do you want to investigate in that? You'd be writing the first coverage of the --disable-monitor feature, but it might be good if that allowed to construct difficult to reach racy situations and make sure we're good.

DimCitus avatar Oct 29 '20 17:10 DimCitus

Do you want to investigate in that?

Sure! No promises on a quick turnaround, though; a couple of other plates are being juggled right now. 😄

jchampio avatar Oct 29 '20 17:10 jchampio

@DimCitus I picked this back up and attempted to rebase it on top of the --disable-monitor features in #494.

Unfortunately, the code that has the bug under test is inside the monitor (RemoveNode()), so disabling it doesn't help us catch the error case. What I think I'd like to see is the ability to run a monitor, so that its code and interactions can be tested, but prevent it from making state transitions until I say so, in the vein of do fsm.

jchampio avatar Nov 30 '20 19:11 jchampio

Hi @jchampio ; then maybe you should add an SQL regression test file to the monitor extension test suite over at https://github.com/citusdata/pg_auto_failover/blob/master/src/monitor/sql/monitor.sql. See also https://github.com/citusdata/pg_auto_failover/issues/464 for allowing make -C src/monitor installcheck to run on the Docker environment for the test suite. Even without that, we can make progress for this PR.

DimCitus avatar Nov 30 '20 19:11 DimCitus

@DimCitus Looks like those regression tests aren't currently passing; git bisect shows 3d6f4af introduced the failures.

jchampio avatar Dec 02 '20 17:12 jchampio