Add multiple retry strategies to improve workflow success rate
Summary
Add multiple retry strategies to improve workflow success rate,like increase memory、increase disk size、change node type.
Use Cases
Users hope that some problems can be solved automatically to avoid them having to troubleshoot.
When a task submitted by a user fails, sometimes when we observe some reasons for the failure, we can do some self-healing to allow the workflow to run smoothly.
For example:
- Increase memory resource while encounter oom. https://github.com/argoproj/argo-workflows/discussions/12482
- Increase disk resource while encounter disk full. Always in Serveless pod.
- Change the Node type when no stock. For example spot type to pay as you go. Always in cloud, especially GPU machines.
- Increase cpu limit when encounter lots of throtting. Always in large physical machines.
- Other possible scenarios.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritize the proposals with the most 👍.
could this https://github.com/argoproj/argo-workflows/issues/10364 be a sufficient solution to this proposal?
Yes I think with #10362 and #10364 there really isn't a need for this. Increasing resource limits is already possible with those combinations.
Using an existing field would be better than adding sprawl to the spec to cover individual use-cases.
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.
This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.