open_clip CI: on-the-fly data generation for regression test determinism

Generate test data on the same machine that runs regression tests.

Torch is non-deterministic between CPU architectures/families/torch releases/platforms. https://pytorch.org/docs/stable/notes/randomness.html

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

This PR ensures tests are run in the same environment in which reference data was generated, creating reproducibility.

check out a specific tag or commit (v2.7.0)
gather model names from test suite that would run on this job (for parallel CI jobs)
generate test data for those configs
check out commit that triggered the workflow
run tests comparing outputs between tag and trigger commit

Test results are now exact and can run without any tolerance, so any change to behavior is detected.

Test data is not generated manually any more, and does not reside in the git history.

Testing against two arbitrary commits or tags can be manually triggered (by collaborators/repo owners) using manual_dispatch.

If not triggered manually with specific revisions, the reference commit is hard coded in the workflow file (currently v2.7.0 + pretrained_hf, earliest commit compatible with this PR), and should be updated with new releases.

This also means that new configs will only be tested once they are in a release or a commit prior to or in the set reference revision.

~~Due to backward/forward compatibility, some lines should be changed once these changes are in a release, see # TODOs in ci.yml.~~

fixes #245

Nov 27 '22 13:11 lopho

Nice, i like it.

2 points:

Can you change testing instructions in readme
Is there a way to test locally after this change?

Nov 27 '22 18:11 rom1504

I'll update the README. Testing still works the same locally.

Nov 28 '22 13:11 lopho

I also need to make a few changes to make reg tests easier locally, right now it's a bit complicated messing with git, copying & restoring test data etc. Should be done in a few hours.

Nov 28 '22 14:11 lopho

Got some weird behaviour from git on the runner, still investigating

sorry for the close/open, buttons are too close together :cry:

Nov 28 '22 19:11 lopho

@rom1504 cleaned it up, workflow is less convoluted now. README is updated with instruction for regression test, util_test.py can now generate test data from arbitrary commits post v2.7.0).

Would be nice if you could try to use this locally to check if the instructions are understandable and it works not only for me and the runner.

Nov 28 '22 20:11 lopho

@lopho looks good but tests are failing now What could be wrong ?

Dec 08 '22 21:12 rom1504

I did not consider the minor python version changing in the GH runner, which invalidated the environment. I've changed the caching to cache for the specific version (e.g. 3.8.15 instead of 3.8)

Dec 08 '22 23:12 lopho

I think it's fine now.

Dec 08 '22 23:12 lopho

Indeed, let's go

Dec 09 '22 00:12 rom1504