torchtune
torchtune copied to clipboard
[regression test] make nightlies and stable independent from each other
Context
What is the purpose of this PR? Is it to
- [ ] add a new feature
- [x] fix a bug
- [ ] update tests and/or documentation
- [ ] other (please add here)
Currently when nightlies break, it also cancels stable regression tests (https://github.com/pytorch/torchtune/actions/runs/10412450496/job/28838240610)
Changelog
The GPT gods told me that if I add {{matrix}} to the group, then one flow failing wont affect the other.
Test plan
CI
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1356
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit 86b6300faa36e6d4d6ae64a7b84c6c14d34976b9 with merge base 67f6a06c5fa0183eacda3dfe0bcd41c4a4c9f480 ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 72.20%. Comparing base (
67f6a06) to head (86b6300).
Additional details and impacted files
@@ Coverage Diff @@
## main #1356 +/- ##
===========================================
+ Coverage 27.41% 72.20% +44.78%
===========================================
Files 269 269
Lines 12598 12598
===========================================
+ Hits 3454 9096 +5642
+ Misses 9144 3502 -5642
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
hi @felipemello1 try adding fail-fast: false under the matrix.
Here is the example from builder repo: https://github.com/pytorch/builder/blob/main/.github/workflows/validate-linux-binaries.yml#L133
linux:
needs: generate-linux-matrix
strategy:
matrix: ${{ fromJson(needs.generate-linux-matrix.outputs.matrix) }}
fail-fast: false
I think relying on CI for testing here won't be sufficient. You may need to hack in a trigger on pull request to get this to actually be tested in CI on your PR (then remove when ready to land). Note that you will also need to hack the test itself to not actually load the checkpoint from S3, since your fork will not have the requisite permissions (happy to provide a pointer here if necessary). Also did you try Andrey's suggestion?
closed in favor of https://github.com/pytorch/torchtune/pull/1413#event-14007317860