MlflowException when logging checkpoints with MLFlowLogger
Bug description
When using MLFlowLogger with log_model=True, an error occurs during training when attempting to log checkpoints:
mlflow.exceptions.MlflowException: Invalid artifact path: 'model/checkpoints/epoch=0-step=151'. Names may be treated as files in certain cases, and must not resolve to other names when treated as such. This name would resolve to 'model/checkpoints/epoch=0-step=151'
Environment
- Python 3.9
- PyTorch Lightning 2.5.2
- MLflow 3.1.0
Error Analysis
Error Origin:
The error is raised at site-packages/mlflow/store/artifact/artifact_repo.py, line 462, in verify_artifact_path:
def verify_artifact_path(artifact_path):
if artifact_path and path_not_unique(artifact_path):
raise MlflowException(
f"Invalid artifact path: '{artifact_path}'. {bad_path_message(artifact_path)}"
)
Validation Failure:
The path_not_unique function (from site-packages/mlflow/utils/validation.py, line 164) fails validation:
def path_not_unique(name):
norm = posixpath.normpath(name)
return norm != name or norm == "." or norm.startswith("..") or norm.startswith("/")
Specifically, norm != name evaluates to True because name is a Path object (not a string), causing the exception.
Root Cause:
In Lightning 2.5.2 (site-packages/lightning/pytorch/loggers/mlflow.py, line 366), artifact_path is constructed as a Path object:
artifact_path = Path(self._checkpoint_path_prefix) / Path(p).stem # Returns Path object
This Path object is passed to MLflow's log_artifact(), ultimately triggering the validation error.
Historical Context:
In Lightning 2.2.4, the same location used a string (no error):
artifact_path = f"model/checkpoints/{Path(p).stem}" # Returns string
In MLflow 2.12.2, the path_not_unique logic was identical, confirming the issue stems from Lightning’s Path usage.
Proposed Fix
Modify the Lightning code to explicitly convert artifact_path to a POSIX string:
artifact_path = (Path(self._checkpoint_path_prefix) / Path(p).stem).as_posix() # Convert to string
After applying this change, the error no longer occurs.
Recommendation
Update the mlflow.py logger in PyTorch Lightning to ensure artifact_path is passed as a string (not Path). This aligns with MLflow’s API expectations and resolves the path normalization issue.
What version are you seeing the problem on?
v2.5
Reproduced in studio
No response
How to reproduce the bug
Just use MLFlowLogger with log_model=True as logger to perfrom training.
Error messages and logs
# Error messages and logs here please
Traceback (most recent call last):
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 48, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run
call._call_teardown_hook(self)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 148, in _call_teardown_hook
logger.finalize("success")
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 41, in wrapped_fn
return fn(*args, **kwargs)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 289, in finalize
self._scan_and_log_checkpoints(self._checkpoint_callback)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 369, in _scan_and_log_checkpoints
self.experiment.log_artifact(self._run_id, p, artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/client.py", line 2433, in log_artifact
self._tracking_client.log_artifact(run_id, local_path, artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 639, in log_artifact
artifact_repo.log_artifact(local_path, artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 33, in log_artifact
verify_artifact_path(artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repo.py", line 464, in verify_artifact_path
raise MlflowException(
mlflow.exceptions.MlflowException: Invalid artifact path: 'model/checkpoints/epoch=0-step=151'. Names may be treated as files in certain cases, and must not resolve to other names when treated as such. This name would resolve to 'model/checkpoints/epoch=0-step=151'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt
_interrupt(trainer, exception)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 82, in _interrupt
logger.finalize("failed")
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 41, in wrapped_fn
return fn(*args, **kwargs)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 289, in finalize
self._scan_and_log_checkpoints(self._checkpoint_callback)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/lightning/pytorch/loggers/mlflow.py", line 369, in _scan_and_log_checkpoints
self.experiment.log_artifact(self._run_id, p, artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/client.py", line 2433, in log_artifact
self._tracking_client.log_artifact(run_id, local_path, artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 639, in log_artifact
artifact_repo.log_artifact(local_path, artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 33, in log_artifact
verify_artifact_path(artifact_path)
File "/home/joshua/miniforge3/envs/PyTorch250/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repo.py", line 464, in verify_artifact_path
raise MlflowException(
mlflow.exceptions.MlflowException: Invalid artifact path: 'model/checkpoints/epoch=0-step=151'. Names may be treated as files in certain cases, and must not resolve to other names when treated as such. This name would resolve to 'model/checkpoints/epoch=0-step=151'
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
It looks like the revert commits for the changes introduced in #20538 were made, but it seems these two commits didn’t make it into the latest release:
cc: @Borda
Please keep an eye out for the release of v2.5.2.post0, which will include the fix commits. In the meantime, you can either use a previous version that doesn’t have this issue or install the latest code from source.
@bhimrazy It looks like the fix suggested by @leike0813 was implemented in #20669, but was subsequently reverted. The current release v2.5.3 is still problematic.
https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.3/src/lightning/pytorch/loggers/mlflow.py#L365-L369
@bhimrazy This is still an issue in 2.5.5
Hi @yxtay, we will soon make a release and aim to fix it.
cc: @Borda @SkafteNicki
@deependujha, this is still a problem in 2.5.6.
https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.6/src/lightning/pytorch/loggers/mlflow.py#L365-L369
This is indeed still an issue in 2.5.6 :/
Hi @leoyala, we're expecting to make 2.6.0 release next week.
@deependujha, I could not see the fix in the 2.6.0 release notes. Is there a PR ready?
Hi @Northo, thanks for the flag, It may have been missed in the 2.6.0 release notes.
Could you confirm whether you still see the issue on the latest 2.6.0 build? If yes, please share a minimal, self-contained repro (small code snippet + exact steps and environment).