mitiq Clean up flaky tests in `test

We seem to have a few flaky tests which have popped up recently giving the dreaded red X. Since our test often take 20-30min to run, restarting them due to a flaky test is undesirable.

In https://github.com/unitaryfund/mitiq/pull/1492 we saw mitiq/zne/tests/test_inference.py::test_plot_data_exp_factory[ExpFactory] fail.
In https://github.com/unitaryfund/mitiq/actions/runs/3130210903/jobs/5080201273 we see mitiq/zne/tests/test_inference.py::test_plot_fit[LinearFactory] fail.

Sep 26 '22 20:09 natestemen

master just failed on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

Oct 03 '22 14:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

Oct 18 '22 20:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

Oct 25 '22 15:10 natestemen

another fail with both

mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
mitiq/zne/tests/test_inference.py::test_plot_data[RichardsonFactory]

Oct 31 '22 15:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_fit[LinearFactory]

Oct 31 '22 19:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

Nov 03 '22 17:11 natestemen

All of the errors linked above have the same pattern, and it's due to pytest-xdist workers going down while running the test.

mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory] 
Current thread 0x0000700007798000 (most recent call first):
[gw2] node down: Not properly terminated
...
[gw2] [ 89%] FAILED mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory] 

replacing crashed worker gw2

This seems to be a known issue in pytest-xdist which we use to speed up test runtime by distributing tests across multiple CPUs. The two issues I've found that are related to this are

https://github.com/pytest-dev/pytest-xdist/issues/466, and
https://github.com/pytest-dev/pytest-xdist/issues/714.

The problem should go away if we removed distribution of the tests, but this IMO is not a reasonable trade-off given the speed-up it provides over running the test sequentially. Ideally pytest-xdist would be able to recover from this situation and re-run the test that was running during the worker failure. For that reason I recommend we keep this open for now, but take no action. That means rerunning failures when we see this, but monitoring it in case a fix is implemented upstream, or the issues arise more/less often and we decide to take action.

WDYT @Misty-W @andreamari

Nov 22 '22 19:11 natestemen

Good finds @natestemen! I agree to keep this open for now and monitor once or twice per milestone cycle. I'll go ahead and remove it from Milestone 0.21 because it's unlikely to be solved by the time it closes. If/when we see a solution that needs action from our side, we can add it to whichever milestone that ends up being.

Nov 22 '22 20:11 Misty-W

Thanks @natestemen! I agree with you and @Misty-W: no action is probably the best strategy right now.

Nov 23 '22 10:11 andreamari

New instance: https://github.com/unitaryfund/mitiq/actions/runs/3619690066/jobs/6101030193#step:5:3704

Dec 06 '22 16:12 andreamari

mitiq
mitiq copied to clipboard

Clean up flaky tests in `test_inference.py`

mitiq mitiq copied to clipboard

Clean up flaky tests in `test_inference.py`

mitiq
mitiq copied to clipboard