Test examples sometimes timeout
The current behavior
It appears that test_run_example_notebooks is sometimes timing out when executing during CI checks.
Expected behavior
The tests should not be timing out.
This should be fairly easy to solve by simply increasing the timeout to from 120 to 240.
I'd like to understand why the tests are timing out in the first place. 120s is a long time for a unit test to be running! If the notebooks really do take that long to run we should consider making them faster (e.g. reduce default number of timesteps). If they're much faster locally then are we running tests with too much contention (use large CircleCI executor? back-off on parallelism?), or is there something else that's causing them to be slow?
I'd rather not just bump timeout to 240s (4m!), but we can do it if the tests taking that long really is working-as-intended, it definitely beats flaky tests.
- In particular, do we know which cells in particular are taking too long to run?
- In general, it seems like CI tests take way longer to run on macOS, suggesting something's going on there. Example from a recent run I did of the CNN reward function branch:
- unit-test-linux: 4 min 41 sec (3 min 10 sec running tests)
- unit-test-windows: 7 min 21 sec (1 min 27 sec restoring cache, 47 sec installing python and binary dependencies, 4 min 16 sec running tests)
- unit-test-macos: 19 min 36 sec (8 min 48 sec installing macos packages, 9 min 25 sec running tests)
Getting 2534 passed, 588 skipped, 4697 warnings in 126.70s (0:02:06) on my M1 Max when limiting the number of cores to 4. (I think that's how many the medium macOS environment has) vs 9 mins on CircleCI.
- Am I skipping too many tests that account for most of the runtime on macOS? (I don't have some optional packages locally installed)
- I cannot (afaik, in a simple way) limit the RAM that gets used when running pytest. Is there a way to measure how much peak RAM a parent+child processes are using? If the RAM gets full, having to read and write from disk is probably quite slow. I would be surprised if this were the reason, but maybe that's what's happening?
(test_examples is taking 32.89s).
TBC, linked example is from a branch where I'm testing some atari environments - I expect CI tests on master to be a bit quicker (but would be surprised if the difference went away). Will edit above comment to make that clearer.
Given MacOS executor is medium and others are x-large it's not that shocking it takes longer. Unfortunately they only go up to large for MacOS and that's not available to us without upgrading to an annual plan. So we might either have to suck up a longer run time on it, or run multiple MacOS executors in parallel and shard the tests across them (fairly easy to set up).
@Rocamonde: We have a memory monitor script running in CircleCI in Linux. It would not be difficult to port to Mac (just needs some slightly different flags to ps I think). You can probably limit RAM using ulimit, although I can't find documentation on this for Mac OS X specifically, this (old) issue from superuser suggests ulimit -m will work same as on Linux/BSD.
Your test output lines up with that of https://app.circleci.com/pipelines/github/HumanCompatibleAI/imitation/2230/workflows/2bafd8ad-ed9b-4f5a-b823-26f5ff5aa415/jobs/5984 I believe?
========= 2534 passed, 588 skipped, 4211 warnings in 488.78s (0:08:08) =========
M1 Max is quite possibly just faster than whatever the CircleCI instance is running on, though 4x does seem like quite a discrepancy. Note even with pytest -n 4 it may sneakily use more than 4 cores if some of the tests run in parallel, and also most CPUs will boost the clock frequency when a small # of cores are running.
Just ran the flaky test 1000 times and got the following:
$ pytest tests/test_examples.py::test_run_example_notebooks --flake-finder --flake-runs=1000 -n 10 -vv
>>> [...] 1000 passed, 20311 warnings in 930.04s (0:15:30)
It seems that I'm not able to replicate the flakiness of the test locally. Of course, I am looking only at aggregates here (will check individual test times soon), but the average is already much lower than what we have on CircleCI.
Mysterious! It could be some interaction between different tests, either because of resource contention (worth checking the memory logs) or some kind of shared mutex (can't think what, but sometimes third-party libraries do something weird, there used to be a race condition in mujoco-py that would sometimes cause things to hang indefinitely if two processes import it at the same time and it's not yet built the Cython extensions...)
The other option is to just try and test the max_iter being too high hypothesis to see what happens if you run for that many iterations, and if that would explain it. If so we could just lower max_iter and see if the problem ever re-occurs on CI...
This seems to happen less often now but I still occasionally see issues e.g. in https://app.circleci.com/pipelines/github/HumanCompatibleAI/imitation/2611/workflows/79cdf535-e821-49fb-ad0f-763d7e719f3c/jobs/7836/tests#failed-test-0 from a month ago so leaving open for now.