Add multiple retry strategies to improve workflow success rate

Open shuangkun opened this issue 1 year ago • 4 comments

Summary

Add multiple retry strategies to improve workflow success rate，like increase memory、increase disk size、change node type.

Use Cases

Users hope that some problems can be solved automatically to avoid them having to troubleshoot.

When a task submitted by a user fails, sometimes when we observe some reasons for the failure, we can do some self-healing to allow the workflow to run smoothly.

For example:

Increase memory resource while encounter oom. https://github.com/argoproj/argo-workflows/discussions/12482
Increase disk resource while encounter disk full. Always in Serveless pod.
Change the Node type when no stock. For example spot type to pay as you go. Always in cloud, especially GPU machines.
Increase cpu limit when encounter lots of throtting. Always in large physical machines.
Other possible scenarios.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritize the proposals with the most 👍.

Feb 28 '24 08:02 shuangkun

could this https://github.com/argoproj/argo-workflows/issues/10364 be a sufficient solution to this proposal?

Mar 04 '24 03:03 tczhao

Yes I think with #10362 and #10364 there really isn't a need for this. Increasing resource limits is already possible with those combinations.

Using an existing field would be better than adding sprawl to the spec to cover individual use-cases.

Mar 05 '24 02:03 agilgur5

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

Mar 23 '24 02:03 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

May 10 '24 02:05 github-actions[bot]

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

May 25 '24 02:05 github-actions[bot]