mitiq icon indicating copy to clipboard operation
mitiq copied to clipboard

Clean up flaky tests in `test_inference.py`

Open natestemen opened this issue 2 years ago • 10 comments

We seem to have a few flaky tests which have popped up recently giving the dreaded red X. Since our test often take 20-30min to run, restarting them due to a flaky test is undesirable.

natestemen avatar Sep 26 '22 20:09 natestemen

master just failed on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

natestemen avatar Oct 03 '22 14:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

natestemen avatar Oct 18 '22 20:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

natestemen avatar Oct 25 '22 15:10 natestemen

another fail with both

  • mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]
  • mitiq/zne/tests/test_inference.py::test_plot_data[RichardsonFactory]

natestemen avatar Oct 31 '22 15:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_fit[LinearFactory]

natestemen avatar Oct 31 '22 19:10 natestemen

another fail on mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory]

natestemen avatar Nov 03 '22 17:11 natestemen

All of the errors linked above have the same pattern, and it's due to pytest-xdist workers going down while running the test.

mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory] 
Current thread 0x0000700007798000 (most recent call first):
[gw2] node down: Not properly terminated
...
[gw2] [ 89%] FAILED mitiq/zne/tests/test_inference.py::test_plot_data[LinearFactory] 

replacing crashed worker gw2

This seems to be a known issue in pytest-xdist which we use to speed up test runtime by distributing tests across multiple CPUs. The two issues I've found that are related to this are

  1. https://github.com/pytest-dev/pytest-xdist/issues/466, and
  2. https://github.com/pytest-dev/pytest-xdist/issues/714.

The problem should go away if we removed distribution of the tests, but this IMO is not a reasonable trade-off given the speed-up it provides over running the test sequentially. Ideally pytest-xdist would be able to recover from this situation and re-run the test that was running during the worker failure. For that reason I recommend we keep this open for now, but take no action. That means rerunning failures when we see this, but monitoring it in case a fix is implemented upstream, or the issues arise more/less often and we decide to take action.

WDYT @Misty-W @andreamari

natestemen avatar Nov 22 '22 19:11 natestemen

Good finds @natestemen! I agree to keep this open for now and monitor once or twice per milestone cycle. I'll go ahead and remove it from Milestone 0.21 because it's unlikely to be solved by the time it closes. If/when we see a solution that needs action from our side, we can add it to whichever milestone that ends up being.

Misty-W avatar Nov 22 '22 20:11 Misty-W

Thanks @natestemen! I agree with you and @Misty-W: no action is probably the best strategy right now.

andreamari avatar Nov 23 '22 10:11 andreamari

New instance: https://github.com/unitaryfund/mitiq/actions/runs/3619690066/jobs/6101030193#step:5:3704

andreamari avatar Dec 06 '22 16:12 andreamari