torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

[regression test] make nightlies and stable independent from each other

Open felipemello1 opened this issue 1 year ago • 4 comments

Context

What is the purpose of this PR? Is it to

  • [ ] add a new feature
  • [x] fix a bug
  • [ ] update tests and/or documentation
  • [ ] other (please add here)

Currently when nightlies break, it also cancels stable regression tests (https://github.com/pytorch/torchtune/actions/runs/10412450496/job/28838240610)

Changelog

The GPT gods told me that if I add {{matrix}} to the group, then one flow failing wont affect the other.

Test plan

CI

felipemello1 avatar Aug 16 '24 19:08 felipemello1

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1356

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 86b6300faa36e6d4d6ae64a7b84c6c14d34976b9 with merge base 67f6a06c5fa0183eacda3dfe0bcd41c4a4c9f480 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Aug 16 '24 19:08 pytorch-bot[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 72.20%. Comparing base (67f6a06) to head (86b6300).

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1356       +/-   ##
===========================================
+ Coverage   27.41%   72.20%   +44.78%     
===========================================
  Files         269      269               
  Lines       12598    12598               
===========================================
+ Hits         3454     9096     +5642     
+ Misses       9144     3502     -5642     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Aug 16 '24 19:08 codecov-commenter

hi @felipemello1 try adding fail-fast: false under the matrix. Here is the example from builder repo: https://github.com/pytorch/builder/blob/main/.github/workflows/validate-linux-binaries.yml#L133

 linux:
    needs: generate-linux-matrix
    strategy:
      matrix: ${{ fromJson(needs.generate-linux-matrix.outputs.matrix) }}
      fail-fast: false

atalman avatar Aug 17 '24 02:08 atalman

I think relying on CI for testing here won't be sufficient. You may need to hack in a trigger on pull request to get this to actually be tested in CI on your PR (then remove when ready to land). Note that you will also need to hack the test itself to not actually load the checkpoint from S3, since your fork will not have the requisite permissions (happy to provide a pointer here if necessary). Also did you try Andrey's suggestion?

ebsmothers avatar Aug 20 '24 16:08 ebsmothers

closed in favor of https://github.com/pytorch/torchtune/pull/1413#event-14007317860

felipemello1 avatar Aug 26 '24 14:08 felipemello1