argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Add multiple retry strategies to improve workflow success rate

Open shuangkun opened this issue 1 year ago • 4 comments

Summary

Add multiple retry strategies to improve workflow success rate,like increase memory、increase disk size、change node type.

Use Cases

Users hope that some problems can be solved automatically to avoid them having to troubleshoot.

When a task submitted by a user fails, sometimes when we observe some reasons for the failure, we can do some self-healing to allow the workflow to run smoothly.

For example:

  1. Increase memory resource while encounter oom. https://github.com/argoproj/argo-workflows/discussions/12482
  2. Increase disk resource while encounter disk full. Always in Serveless pod.
  3. Change the Node type when no stock. For example spot type to pay as you go. Always in cloud, especially GPU machines.
  4. Increase cpu limit when encounter lots of throtting. Always in large physical machines.
  5. Other possible scenarios.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritize the proposals with the most 👍.

shuangkun avatar Feb 28 '24 08:02 shuangkun

could this https://github.com/argoproj/argo-workflows/issues/10364 be a sufficient solution to this proposal?

tczhao avatar Mar 04 '24 03:03 tczhao

Yes I think with #10362 and #10364 there really isn't a need for this. Increasing resource limits is already possible with those combinations.

Using an existing field would be better than adding sprawl to the spec to cover individual use-cases.

agilgur5 avatar Mar 05 '24 02:03 agilgur5

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

github-actions[bot] avatar Mar 23 '24 02:03 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

github-actions[bot] avatar May 10 '24 02:05 github-actions[bot]

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

github-actions[bot] avatar May 25 '24 02:05 github-actions[bot]