bayes-kit icon indicating copy to clipboard operation
bayes-kit copied to clipboard

Make tests deterministic from seed

Open wrongu opened this issue 2 years ago • 7 comments
trafficstars

Add conftest.py and use it to set a random seed for all pytest tests.

This PR makes pytest deterministic and solves the problem of tests randomly failing due to unlucky RNG. There are pros and cons to this, of course. I find that it tends to help in CI settings.

wrongu avatar May 26 '23 20:05 wrongu

We need to work on this issue to make CI less of a hassle, but I’m not convinced fixing a seed is the approach we should take

WardBrian avatar May 26 '23 20:05 WardBrian

Codecov Report

Merging #31 (ccee362) into main (f845628) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##             main      #31   +/-   ##
=======================================
  Coverage   95.76%   95.76%           
=======================================
  Files           9        9           
  Lines         260      260           
=======================================
  Hits          249      249           
  Misses         11       11           

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar May 26 '23 20:05 codecov-commenter

This could alternatively be configured to run twice, once from seed and once without. Or n times with n seeds. Or any further combination.

Ultimately up to you of course. Just a suggestion.

wrongu avatar May 26 '23 21:05 wrongu

I didn't see a related issue this solves, but the PR indicates it's failures for some RNG seeds. So the obvious question is why there are failures. Is it just too strict a test interval like testing that 95% intervals cover true results and cause failures 5% of the time? In that case, we can either (a) widen the test interval, or (b) do repeated tests (if we do 20 tests, we expect binomial(.05, 20) failures and can test that expectation), or (c) fix seeds.

The downside of (a) is that it reduces test power, the downside of (b) is a lot more computation, and the downside of (c) is brittle tests. I don't think there's an optimal solution, so we just need to pick something.

bob-carpenter avatar May 30 '23 18:05 bob-carpenter

Yes, the tests for e.g. HMC have tolerances that mean they still fail on some runs.

Our current test suite runs in ~12 seconds, so we have a lot head room to run more iterations/tests but we need to keep an eye on this as the number of algorithms increases. At the moment for the purposes of reviewing I've been doing a sort of hand-wavey equivalent of (b) by seeing that only one or two test runs (out of the 12 we do in our matrix of python version and OS) failed

WardBrian avatar May 30 '23 18:05 WardBrian

I recently stumbled on https://github.com/pytest-dev/pytest-rerunfailures, which could be another approach we use (maybe only as a stopgap)

WardBrian avatar Nov 16 '23 15:11 WardBrian

I see the appeal of pytest-rerunfailures, especially as a stopgap.

Brian's "hand-wavey" strategy seems good to me, especially if we just couch it in stats-speak. Modify Bob's suggestion (b) to test for 90% coverage (or some such number not too much smaller than 95%), e.g. at least 18 of 20 repeated tests pass, instead of boolean testing the expectation. I think this would allow more wiggle room while still protecting test power.

Then pytest-rerunfailures maybe once or twice, in case we didn't build in enough wiggle room and we find ourselves frustrated by the tests.

roualdes avatar Nov 16 '23 17:11 roualdes