mitiq
mitiq copied to clipboard
Clean up flaky tests in `test_inference.py`
We seem to have a few flaky tests which have popped up recently giving the dreaded red X. Since our test often take 20-30min to run, restarting them due to a flaky test is undesirable.
- In https://github.com/unitaryfund/mitiq/pull/1492 we saw
mitiq/zne/tests/test_inference.py::test_plot_data_exp_factory[ExpFactory]
fail. - In https://github.com/unitaryfund/mitiq/actions/runs/3130210903/jobs/5080201273 we see
mitiq/zne/tests/test_inference.py::test_plot_fit[LinearFactory]
fail.
master just failed on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
another fail with both
-
mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
-
mitiq/zne/tests/test_inference.py::test_plot_data[RichardsonFactory]
another fail on mitiq/zne/tests/test_inference.py::test_plot_fit[LinearFactory]
another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
All of the errors linked above have the same pattern, and it's due to pytest-xdist
workers going down while running the test.
mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
Current thread 0x0000700007798000 (most recent call first):
[gw2] node down: Not properly terminated
...
[gw2] [ 89%] FAILED mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
replacing crashed worker gw2
This seems to be a known issue in pytest-xdist
which we use to speed up test runtime by distributing tests across multiple CPUs. The two issues I've found that are related to this are
- https://github.com/pytest-dev/pytest-xdist/issues/466, and
- https://github.com/pytest-dev/pytest-xdist/issues/714.
The problem should go away if we removed distribution of the tests, but this IMO is not a reasonable trade-off given the speed-up it provides over running the test sequentially. Ideally pytest-xdist
would be able to recover from this situation and re-run the test that was running during the worker failure. For that reason I recommend we keep this open for now, but take no action. That means rerunning failures when we see this, but monitoring it in case a fix is implemented upstream, or the issues arise more/less often and we decide to take action.
WDYT @Misty-W @andreamari
Good finds @natestemen! I agree to keep this open for now and monitor once or twice per milestone cycle. I'll go ahead and remove it from Milestone 0.21 because it's unlikely to be solved by the time it closes. If/when we see a solution that needs action from our side, we can add it to whichever milestone that ends up being.
Thanks @natestemen! I agree with you and @Misty-W: no action is probably the best strategy right now.
New instance: https://github.com/unitaryfund/mitiq/actions/runs/3619690066/jobs/6101030193#step:5:3704