metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

Allow configuration of max attempts for a task

Open dhpikolo opened this issue 10 months ago • 2 comments

Currently, a user can attempt to run a specific task up to a maximum of 6 times. It would be beneficial to make this value configurable.

In our use case, we are working on integrating Argo retries with Metaflow’s retried Argo workflows. This environment variable would allow us to set a limit on how many times a user can retry an Argo workflow.

That said, beyond our specific use case, adding this configuration flexibility would be generally useful.

Current Behaviour

import pandas as pd
from metaflow import (
    FlowSpec,
    Parameter,
    card,
    project,
    step,
    retry
)


@project(name="dummy_project")
class HelloWorld(FlowSpec):
    force_error = Parameter("force-error", type=bool, default=False)

    @card
    @step
    def start(self):
        print("something")
        self.next(self.end)

    @card
    @retry(times=10)
    @step
    def end(self):
        if self.force_error:
            raise Exception("Testing errors in metaflow")
        print(f"the data artifact is: {self.my_var}")


if __name__ == "__main__":
    HelloWorld()

  • Running the above flow locally via python hello_world.py run throws the following exception
Metaflow 2.14.0 executing HelloWorld for user:j.kollipara
Project: dummy_project, Branch: user.j.kollipara
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
    Flow failed:
    The maximum number of retries is @retry(times=4).

error: Recipe `_poetry-run` failed with exit code 1

Source code of the above error: https://github.com/Netflix/metaflow/blob/5c960eaff1ae486f503b37177f03cc1419b5571d/metaflow/plugins/retry_decorator.py#L30-L37

Proposed Behaviour

Setting METAFLOW_MAX_ATTEMPTS=12 would allow users to run the above flow.

dhpikolo avatar Feb 18 '25 16:02 dhpikolo

I have already put up a PR with the proposed change, let me know what you guys would think of it.

  • https://github.com/Netflix/metaflow/pull/2279

dhpikolo avatar Feb 18 '25 16:02 dhpikolo

Created a new PR, since the old PR was based on development branch.

dhpikolo avatar Feb 19 '25 13:02 dhpikolo