metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

[WIP] feature: support argo retry

Open dhpikolo opened this issue 10 months ago • 1 comments

This PR, modifies the step cli command to infer retry_count from flow_datastore class. Looks like this class, holds all the information about underlying datastore and run artifacts.

  • Adds a flow_datastore class method that infers latest done attempt of a task.
  • Remove MF_ATTEMPT from mflog.save_logs python script and infer attempt from the datastore.

Resolves: https://github.com/Netflix/metaflow/issues/2278

dhpikolo avatar Feb 19 '25 09:02 dhpikolo

A known limitation for this solution would be:

  • Retrying more than 6 times (or in current case, hitting argo retry button twice) times results in metaflow exception

Screenshot 2025-02-14 at 14 18 23

We are planning to address this limitation by making MAX_ATTEMPTS configurable via envvar.

  • https://github.com/Netflix/metaflow/pull/2279

dhpikolo avatar Feb 19 '25 09:02 dhpikolo