client
client copied to clipboard
Add a monkey patch to make MLflow logging functions not throw after failed retries
Customer request: make it so MLflow logging functions failing doesn't crash the program. For now it's done by making MLflow functions fail "silently" (failure gets printed out in the log).
For this I'm basically patching every function in the mlflow
package to add a try/catch.
The try/catch bubbles up exceptions until the topmost call in the mlflow package, so all handling inside of mlflow should be running correctly.
I'm also patching all functions in mlflow.MlflowClient
class.
The functions that return a context manager I'm retrying for 15 minutes instead, because they have to return something. Maybe I could be returning a mock context manager that no-ops.
with mlflow.start_run():
# Code here
I don't think this is in general a good approach, because, for example, silencing an error on a function that's supposed to be returning some value, leads to the function returning nothing. This could break something down the way in the user code, for example:
experiment = mlflow.get_experiment("My Experiment") # This fails
mlflow.delete_experiment(experiment.id) # Now here there's no experiment
IMO, it would be better if users wrap their code in a try/except MlflowException
and achieve the same result without the potential for weird subtle breakages like this.
Example code that this was tested with:
import mlflow
from dagshub.mlflow import patch_mlflow
patch_mlflow()
with mlflow.start_run():
mlflow.log_metric("logged_metric_1", 1)
time.sleep(10) # Turn off server here
mlflow.log_metric("unlogged_metric_1", 2)
time.sleep(15) # Turn server back on here
mlflow.log_metric("logged_metric_2", 3)
The result:
The running script printed out the exception for the unlogged metric log
Interesting, I didn't have the various bookkeeping functions like start_run
in mind when I asked for this.
For those, it makes more sense to throw an error, since they're more likely to be O(1) operations and less likely to fail in the middle of a days long training run.
I was thinking more of the log_*
methods which continuously monitor a long lived process.