client icon indicating copy to clipboard operation
client copied to clipboard

Add a monkey patch to make MLflow logging functions not throw after failed retries

Open kbolashev opened this issue 10 months ago • 2 comments

Customer request: make it so MLflow logging functions failing doesn't crash the program. For now it's done by making MLflow functions fail "silently" (failure gets printed out in the log).

For this I'm basically patching every function in the mlflow package to add a try/catch. The try/catch bubbles up exceptions until the topmost call in the mlflow package, so all handling inside of mlflow should be running correctly.

I'm also patching all functions in mlflow.MlflowClient class. The functions that return a context manager I'm retrying for 15 minutes instead, because they have to return something. Maybe I could be returning a mock context manager that no-ops.

with mlflow.start_run():
    # Code here

I don't think this is in general a good approach, because, for example, silencing an error on a function that's supposed to be returning some value, leads to the function returning nothing. This could break something down the way in the user code, for example:

experiment = mlflow.get_experiment("My Experiment") # This fails
mlflow.delete_experiment(experiment.id) # Now here there's no experiment

IMO, it would be better if users wrap their code in a try/except MlflowException and achieve the same result without the potential for weird subtle breakages like this.

Example code that this was tested with:

import mlflow
from dagshub.mlflow import patch_mlflow

patch_mlflow()

with mlflow.start_run():
    mlflow.log_metric("logged_metric_1", 1)
    time.sleep(10) # Turn off server here
    mlflow.log_metric("unlogged_metric_1", 2)
    time.sleep(15) # Turn server back on here
    mlflow.log_metric("logged_metric_2", 3)

The result: image

The running script printed out the exception for the unlogged metric log

kbolashev avatar May 02 '24 09:05 kbolashev

Interesting, I didn't have the various bookkeeping functions like start_run in mind when I asked for this. For those, it makes more sense to throw an error, since they're more likely to be O(1) operations and less likely to fail in the middle of a days long training run. I was thinking more of the log_* methods which continuously monitor a long lived process.

guysmoilov avatar May 05 '24 11:05 guysmoilov