pytorch-lightning Added support for flushing Comet experiment data to the Comet after saving a checkpoint.

Hey, this is the PR from Comet's SDK engineer.

What does this PR do?

Added support for flushing Comet experiment data to the Comet after saving a checkpoint. This behavior is configurable through the flush_every parameter of the CometLogger.

Fixes #20681

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
[x] Did you read the contributor guideline, Pull Request section?
[x] Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
[x] Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

[ ] Is this pull request ready for review? (if not, please submit in draft mode)
[ ] Check that all items from Before submitting are resolved
[ ] Make sure the title is self-explanatory and the description concisely explains the PR
[ ] Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20680.org.readthedocs.build/en/20680/

Mar 27 '25 17:03 yaricom

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 79%. Comparing base (df5dee6) to head (25c63f7).

:exclamation: There is a different number of reports uploaded between BASE (df5dee6) and HEAD (25c63f7). Click for more details.

HEAD has 599 uploads less than BASE

Flag BASE (df5dee6) HEAD (25c63f7)

cpu 161 27

python 18 3

lightning_fabric 37 0

pytest 83 0

python3.10 36 6

lightning 91 15

python3.11 36 6

python3.12.7 54 9

python3.12 17 3

gpu 3 0

pytorch2.1 18 6

pytorch_lightning 36 12

pytest-full 81 27

pytorch2.3 9 3

pytorch2.2.2 9 3

pytorch2.5 9 3

pytorch2.4.1 9 3

pytorch2.7 9 3

pytorch2.5.1 9 3

pytorch2.6 9 3

Flag	BASE (df5dee6)	HEAD (25c63f7)
cpu	161	27
python	18	3
lightning_fabric	37	0
pytest	83	0
python3.10	36	6
lightning	91	15
python3.11	36	6
python3.12.7	54	9
python3.12	17	3
gpu	3	0
pytorch2.1	18	6
pytorch_lightning	36	12
pytest-full	81	27
pytorch2.3	9	3
pytorch2.2.2	9	3
pytorch2.5	9	3
pytorch2.4.1	9	3
pytorch2.7	9	3
pytorch2.5.1	9	3
pytorch2.6	9	3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #20680     +/-   ##
=========================================
- Coverage      87%      79%     -9%     
=========================================
  Files         268      265      -3     
  Lines       23449    23400     -49     
=========================================
- Hits        20501    18392   -2109     
- Misses       2948     5008   +2060

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Mar 27 '25 23:03 codecov[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

Apr 16 '25 05:04 stale[bot]

Dear maintainers, please advise how to make this PR better. Thank you!

Apr 16 '25 06:04 yaricom

Hi @Borda! I'm also SDK engineer from Comet. Can we please get a review for this PR? We have users waiting for these changes to be released.

Jun 03 '25 12:06 alexkuzmik

Can we please get a review for this PR? We have users waiting for these changes to be released.

hello, please reach out to @williamFalcon

Jun 03 '25 14:06 Borda

Hi, I dont know why, but call comet_expr.flush() in the after_save_checkpoint with DDP cause EOFError: Ran out of input ( raise from ModelCheckpoint.file_exists method, which call stategy.broacast. When does not flush, my code work well. The error occur determinately depend on the setup (such as batchsize) , and cause in RANK 1, raise after the after_save_checkpoint, which is zero_rank_only return. Do you face the same problem ?

Aug 23 '25 16:08 hieubnt235

pytorch-lightning pytorch-lightning copied to clipboard

Added support for flushing Comet experiment data to the Comet after saving a checkpoint.

What does this PR do?

PR review

Codecov Report

pytorch-lightning
pytorch-lightning copied to clipboard