CI: on-the-fly data generation for regression test determinism
Generate test data on the same machine that runs regression tests.
Torch is non-deterministic between CPU architectures/families/torch releases/platforms. https://pytorch.org/docs/stable/notes/randomness.html
Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.
This PR ensures tests are run in the same environment in which reference data was generated, creating reproducibility.
- check out a specific tag or commit (v2.7.0)
- gather model names from test suite that would run on this job (for parallel CI jobs)
- generate test data for those configs
- check out commit that triggered the workflow
- run tests comparing outputs between tag and trigger commit
Test results are now exact and can run without any tolerance, so any change to behavior is detected.
Test data is not generated manually any more, and does not reside in the git history.
Testing against two arbitrary commits or tags can be manually triggered (by collaborators/repo owners) using manual_dispatch.
If not triggered manually with specific revisions, the reference commit is hard coded in the workflow file (currently v2.7.0 + pretrained_hf, earliest commit compatible with this PR), and should be updated with new releases.
This also means that new configs will only be tested once they are in a release or a commit prior to or in the set reference revision.
~~Due to backward/forward compatibility, some lines should be changed once these changes are in a release, see # TODOs in ci.yml.~~
fixes #245
Nice, i like it.
2 points:
- Can you change testing instructions in readme
- Is there a way to test locally after this change?
I'll update the README. Testing still works the same locally.
I also need to make a few changes to make reg tests easier locally, right now it's a bit complicated messing with git, copying & restoring test data etc. Should be done in a few hours.
Got some weird behaviour from git on the runner, still investigating
sorry for the close/open, buttons are too close together :cry:
@rom1504 cleaned it up, workflow is less convoluted now. README is updated with instruction for regression test, util_test.py can now generate test data from arbitrary commits post v2.7.0).
Would be nice if you could try to use this locally to check if the instructions are understandable and it works not only for me and the runner.
@lopho looks good but tests are failing now What could be wrong ?
I did not consider the minor python version changing in the GH runner, which invalidated the environment. I've changed the caching to cache for the specific version (e.g. 3.8.15 instead of 3.8)
I think it's fine now.
Indeed, let's go