argo-workflows
argo-workflows copied to clipboard
feat: handle interruptions as a transient error
Motivation
I have had this in my TRANSIENT_ERROR_PATTERN to handle automatically retrying pods that have been removed for one reason or another. This is to add it to the upstream argo-workflows. I feel it would be useful for users, especially those running workflows on preemptive or spot instances.
Modifications
Added below as a transient error.
(OOMKilled|imminent node shutdown|node was low on resource|pod deleted|node is shutting down|Deadline exceeded)
Verification
Used the TRANSIENT_ERROR_PATTERN at Forbes.
Documentation
This file is already referenced in the documentation.
/retest
/retest
/retest