imitation
imitation copied to clipboard
Review and fix flaky tests
Bug description
Siince #534, the random number generator convention in the entire code has been redesigned, and all tests have been changed by default to not have fixed seeds and explore internal sources of entropy instead, unless specifically testing for seed consistency. Some of the tests have been exposed as flaky as a result, and this requires either fixing the seeds to a value that always works (not ideal), or improving the test design to reduce the flakiness to an acceptable fraction e.g. 10^-3.
A specific test that seems to be flaky is the test test_trainer_makes_progress
, specifically the assertion
https://github.com/HumanCompatibleAI/imitation/blob/312b091d550c11076d40d056d5c699fee965c2df/tests/algorithms/test_dagger.py#L302
fails fairly frequently when developing locally.
Steps to reproduce
run
$ pytest -n 10 tests/algorithms/test_dagger.py::test_trainer_makes_progress --flake-finder --flake-runs=100
to get 38% flakiness:
=========== test session starts ===========
platform darwin -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/rocamonde/git/HumanCompatibleAI/imitation, configfile: setup.cfg
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, pytest_notebook-0.8.0, hypothesis-6.54.4, flakefinder-1.0.0, cov-3.0.0
gw0 [800] / gw1 [800] / gw2 [800] / gw3 [800] / gw4 [800] / gw5 [800] / gw6 [800] / gw7 [800] / gw8 [800] / gw9 [800]
sssss.......FF.ss......ss.ss.ss.ssF......F...F....ss....ss.ss...ss...ss............ss.s.sss...........ss.ss.......ss.ss........ss........ss.ssF.sss...s....s...sF....ss........s.....F..s.Fs.......s..s...............s.s..s..s...s.s...s.s.F.s.........s...F.s.s.F..s...s.............s.........s..s....s..s... [ 38%]
Environment
- Operating system and version: macOS 12.4 Monterey
- Python version: Python 3.8.18
38% is a lot, I'd expect to see errors happening frequently in CI. This happens even on the master branch?
- I think the s is not a fail, but skipped. It's strange that it randomly skips the test, but I guess it checks if the random parametrization is invalid. You're right in that 38% takes into account the skips. Removing that, I have 12/304 (3.9%) of fail. Still bad but not as bad as I put it originally
- Yes, that's on the master branch.
Ah 3.9% failures is believable we wouldn't have noticed. Agreed we should fix this, and probably don't want it to be randomly skipped either...
Is this related to tests failing on master? I notice that in the past few days there's usually a couple unit tests failing on linux, and right now it looks like a different test is failing than was failing yesterday (IIRC).
Yes, AFAIK this is the reason. We made almost all tests that previously had fixed seeds now be random (the exception are those that literally test whether fixing the seeds leads to the same results), so there is some probability they will fail. Ways to fix this are:
-
Increase the number of draws to bring the probability exponentially down to some acceptable level (preferred)
-
Fix the seed for tests that fail too often if it would take too much compute to do the above (discouraged, probably better to re-write the test)
I plan to do some flakiness benchmarking this week, but if anyone has more time they are more than welcome to take a look sooner.
Flakiness benchmarking would be great -- thanks for looking into this @Rocamonde!
Agree let's try to avoid fixing the seed. We can always in some cases reduce the threshold required for the test to pass, e.g. extent to which progress improves, and IMO that's better than fixing a seed.
@Rocamonde did you happen to get around to doing any flakiness benchmarking? I'm going to try and address some of the flaky tests over the next few days, just wanted to see if you had any suggestions for where to start.
I know he made a start in https://github.com/HumanCompatibleAI/imitation/pull/584 but I think those were somewhat confounded (e.g. resource constraints from parallelism causing not usually flaky tests to also fail).
@Rocamonde did you happen to get around to doing any flakiness benchmarking? I'm going to try and address some of the flaky tests over the next few days, just wanted to see if you had any suggestions for where to start.
Sorry I never got around finishing this! Please do take a look at my draft PR and feel free to ask any questions.
A quick 80/20 fix for now would be: go to circle CI, check the flakiest tests, confirm that they are flaky by running them locally (it could be that they break often when we write code but not actually flaky everything else equal), then try to fix them. If other tests have not shown up as flaky now, they probably aren't.
However my PR was more ambitious and tried to detect any flaky tests from a clean slate. This is useful as a CI feature to make sure people don't add flaky tests in the future, but probably fixing the current ones should be a priority if getting the parallelism to work is not easily doable.
I believe outstanding flaky tests have been addressed. Please open new issue for any specific flakiness discovered.