pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

PR: Fix Duplicate Metric Logging in MLFlowLogger to Prevent MLflow Database Errors

Open KAVYANSHTYAGI opened this issue 6 months ago • 3 comments

What does this PR do?

This PR fixes a long standing issue in PyTorch Lightning’s MLFlowLogger where logging the same metric (with the same name and step) more than once in a run causes a unique constraint violation on certain MLflow backends (e.g., PostgreSQL). Now, MLFlowLogger tracks (metric, step) pairs and skips any duplicate metric logs within a run, preventing database errors and improving robustness.

This change also updates the class docstring to document this new behavior and adds a unit test to verify that duplicate metric logs are ignored as expected.

Fixes #20865

Motivation and Context

Some MLflow tracking servers (such as those backed by PostgreSQL) enforce a unique constraint on metrics.

If the same metric (with identical name and step) is logged more than once, MLflow returns an error and metric logging fails, potentially halting training.

This situation often arises when users call .log() in multiple hooks or callbacks.

The deduplication logic ensures only the first log of a metric per (name, step) is recorded per run.

Dependencies

No new dependencies are introduced.

Does your PR introduce any breaking changes?

No breaking changes .... existing behavior is preserved except that duplicate metric logs are now silently skipped (users may see a log message if a duplicate is skipped).

Other Checklist Items

Documentation updated- yes(see class docstring in MLFlowLogger)

New test added for deduplication- yes

Fun fact: This change will help Lightning users avoid subtle training failures, especially with remote or production MLflow tracking servers!


📚 Documentation preview 📚: https://pytorch-lightning--20871.org.readthedocs.build/en/20871/

KAVYANSHTYAGI avatar Jun 02 '25 13:06 KAVYANSHTYAGI

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

stale[bot] avatar Jul 19 '25 05:07 stale[bot]

will maybe also fix #20902

SkafteNicki avatar Aug 13 '25 10:08 SkafteNicki

@KAVYANSHTYAGI could you please check failing tests?

SkafteNicki avatar Aug 14 '25 04:08 SkafteNicki