pytorch-lightning
pytorch-lightning copied to clipboard
Added support for flushing Comet experiment data to the Comet after saving a checkpoint.
Hey, this is the PR from Comet's SDK engineer.
What does this PR do?
- Added support for flushing Comet experiment data to the Comet after saving a checkpoint. This behavior is configurable through the
flush_everyparameter of theCometLogger.
Fixes #20681
Before submitting
- Was this discussed/agreed via a GitHub issue? (not for typos and docs)
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
- Did you make sure to update the documentation with your changes? (if necessary)
- Did you write any new necessary tests? (not for typos and docs)
- [x] Did you verify new and existing tests pass locally with your changes?
- Did you list all the breaking changes introduced by this pull request?
- Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)
PR review
Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
- [ ] Is this pull request ready for review? (if not, please submit in draft mode)
- [ ] Check that all items from Before submitting are resolved
- [ ] Make sure the title is self-explanatory and the description concisely explains the PR
- [ ] Add labels and milestones (and optionally projects) to the PR so it can be classified
📚 Documentation preview 📚: https://pytorch-lightning--20680.org.readthedocs.build/en/20680/
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 79%. Comparing base (
df5dee6) to head (25c63f7).
:exclamation: There is a different number of reports uploaded between BASE (df5dee6) and HEAD (25c63f7). Click for more details.
HEAD has 599 uploads less than BASE
Flag BASE (df5dee6) HEAD (25c63f7) cpu 161 27 python 18 3 lightning_fabric 37 0 pytest 83 0 python3.10 36 6 lightning 91 15 python3.11 36 6 python3.12.7 54 9 python3.12 17 3 gpu 3 0 pytorch2.1 18 6 pytorch_lightning 36 12 pytest-full 81 27 pytorch2.3 9 3 pytorch2.2.2 9 3 pytorch2.5 9 3 pytorch2.4.1 9 3 pytorch2.7 9 3 pytorch2.5.1 9 3 pytorch2.6 9 3
Additional details and impacted files
@@ Coverage Diff @@
## master #20680 +/- ##
=========================================
- Coverage 87% 79% -9%
=========================================
Files 268 265 -3
Lines 23449 23400 -49
=========================================
- Hits 20501 18392 -2109
- Misses 2948 5008 +2060
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.
Dear maintainers, please advise how to make this PR better. Thank you!
Hi @Borda! I'm also SDK engineer from Comet. Can we please get a review for this PR? We have users waiting for these changes to be released.
Can we please get a review for this PR? We have users waiting for these changes to be released.
hello, please reach out to @williamFalcon
Hi, I dont know why, but call comet_expr.flush() in the after_save_checkpoint with DDP cause EOFError: Ran out of input ( raise from ModelCheckpoint.file_exists method, which call stategy.broacast. When does not flush, my code work well. The error occur determinately depend on the setup (such as batchsize) , and cause in RANK 1, raise after the after_save_checkpoint, which is zero_rank_only return. Do you face the same problem ?