pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

MlflowException when logging checkpoints with MLFlowLogger

Open leike0813 opened this issue 6 months ago • 9 comments

Bug description

When using MLFlowLogger with log_model=True, an error occurs during training when attempting to log checkpoints:

mlflow.exceptions.MlflowException: Invalid artifact path: 'model/checkpoints/epoch=0-step=151'. Names may be treated as files in certain cases, and must not resolve to other names when treated as such. This name would resolve to 'model/checkpoints/epoch=0-step=151'

Environment

  • Python 3.9
  • PyTorch Lightning 2.5.2
  • MLflow 3.1.0

Error Analysis

Error Origin:

The error is raised at site-packages/mlflow/store/artifact/artifact_repo.py, line 462, in verify_artifact_path:

def verify_artifact_path(artifact_path):
    if artifact_path and path_not_unique(artifact_path):
        raise MlflowException(
            f"Invalid artifact path: '{artifact_path}'. {bad_path_message(artifact_path)}"
        )

Validation Failure:

The path_not_unique function (from site-packages/mlflow/utils/validation.py, line 164) fails validation:

def path_not_unique(name):
    norm = posixpath.normpath(name)
    return norm != name or norm == "." or norm.startswith("..") or norm.startswith("/")

Specifically, norm != name evaluates to True because name is a Path object (not a string), causing the exception.

Root Cause:

In Lightning 2.5.2 (site-packages/lightning/pytorch/loggers/mlflow.py, line 366), artifact_path is constructed as a Path object:

artifact_path = Path(self._checkpoint_path_prefix) / Path(p).stem  # Returns Path object

This Path object is passed to MLflow's log_artifact(), ultimately triggering the validation error.

Historical Context:

In Lightning 2.2.4, the same location used a string (no error):

artifact_path = f"model/checkpoints/{Path(p).stem}"  # Returns string

In MLflow 2.12.2, the path_not_unique logic was identical, confirming the issue stems from Lightning’s Path usage.

Proposed Fix

Modify the Lightning code to explicitly convert artifact_path to a POSIX string:

artifact_path = (Path(self._checkpoint_path_prefix) / Path(p).stem).as_posix()  # Convert to string

After applying this change, the error no longer occurs.

Recommendation

Update the mlflow.py logger in PyTorch Lightning to ensure artifact_path is passed as a string (not Path). This aligns with MLflow’s API expectations and resolves the path normalization issue.

What version are you seeing the problem on?

v2.5

Reproduced in studio

No response

How to reproduce the bug

Just use MLFlowLogger with log_model=True as logger to perfrom training.

Error messages and logs

# Error messages and logs here please
Traceback (most recent call last):
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 48, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run
    call._call_teardown_hook(self)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 148, in _call_teardown_hook
    logger.finalize("success")
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 41, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 289, in finalize
    self._scan_and_log_checkpoints(self._checkpoint_callback)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 369, in _scan_and_log_checkpoints
    self.experiment.log_artifact(self._run_id, p, artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/client.py", line 2433, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 639, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 33, in log_artifact
    verify_artifact_path(artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repo.py", line 464, in verify_artifact_path
    raise MlflowException(
mlflow.exceptions.MlflowException: Invalid artifact path: 'model/checkpoints/epoch=0-step=151'. Names may be treated as files in certain cases, and must not resolve to other names when treated as such. This name would resolve to 'model/checkpoints/epoch=0-step=151'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt
    _interrupt(trainer, exception)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 82, in _interrupt
    logger.finalize("failed")
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 41, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 289, in finalize
    self._scan_and_log_checkpoints(self._checkpoint_callback)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 369, in _scan_and_log_checkpoints
    self.experiment.log_artifact(self._run_id, p, artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/client.py", line 2433, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 639, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 33, in log_artifact
    verify_artifact_path(artifact_path)
  File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repo.py", line 464, in verify_artifact_path
    raise MlflowException(
mlflow.exceptions.MlflowException: Invalid artifact path: 'model/checkpoints/epoch=0-step=151'. Names may be treated as files in certain cases, and must not resolve to other names when treated as such. This name would resolve to 'model/checkpoints/epoch=0-step=151'

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

leike0813 avatar Jun 24 '25 03:06 leike0813

It looks like the revert commits for the changes introduced in #20538 were made, but it seems these two commits didn’t make it into the latest release:

Image

cc: @Borda

bhimrazy avatar Jun 24 '25 05:06 bhimrazy

Please keep an eye out for the release of v2.5.2.post0, which will include the fix commits. In the meantime, you can either use a previous version that doesn’t have this issue or install the latest code from source.

bhimrazy avatar Jun 27 '25 06:06 bhimrazy

@bhimrazy It looks like the fix suggested by @leike0813 was implemented in #20669, but was subsequently reverted. The current release v2.5.3 is still problematic.

https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.3/src/lightning/pytorch/loggers/mlflow.py#L365-L369

yxtay avatar Aug 14 '25 06:08 yxtay

@bhimrazy This is still an issue in 2.5.5

patrontheo avatar Oct 27 '25 09:10 patrontheo

Hi @yxtay, we will soon make a release and aim to fix it.

cc: @Borda @SkafteNicki

deependujha avatar Oct 29 '25 08:10 deependujha

@deependujha, this is still a problem in 2.5.6.

https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.6/src/lightning/pytorch/loggers/mlflow.py#L365-L369

yxtay avatar Nov 06 '25 02:11 yxtay

This is indeed still an issue in 2.5.6 :/

leoyala avatar Nov 13 '25 11:11 leoyala

Hi @leoyala, we're expecting to make 2.6.0 release next week.

deependujha avatar Nov 20 '25 15:11 deependujha

@deependujha, I could not see the fix in the 2.6.0 release notes. Is there a PR ready?

Northo avatar Dec 01 '25 08:12 Northo

Hi @Northo, thanks for the flag, It may have been missed in the 2.6.0 release notes.

Could you confirm whether you still see the issue on the latest 2.6.0 build? If yes, please share a minimal, self-contained repro (small code snippet + exact steps and environment).

deependujha avatar Dec 11 '25 10:12 deependujha