screwdriver
screwdriver copied to clipboard
Conditionally retry a step on failure
What happened:
Sometimes a step can fail because of external dependencies. There is no option to retry the command under different circumstances without restarting a new build or code changes.
What you expected to happen:
Provide an option to conditionally retry failed steps.
- Screwdriver workflow to provide a step config to specify that step must be retried.
- Retry should support setting different environment variables.
- Optionally provide a condition which should determine whether retry should happen or not.
- For Screwdriver provided setup & teardown steps. cluster admins should be able to define the retry condition.
- User's can optionally specific
retry
condition without ability to overridecommand
- User's can optionally specific
- Provide means to easily add retry configuration to multiple steps
For example
steps:
sd-setup-scm:
command: git clone foo bar....
retry: # object below or just `true`
condition: $GIT_SHALLOW_CLONE == true # optional
maxRetry: 3 # optional, default 1
interval: 3 # optional, default 0 (second)
environment: # optional
GIT_SHALLOW_CLONE: false
How to reproduce it:
N/A
This is also related to https://github.com/screwdriver-cd/screwdriver/issues/1208 When making model changes we should keep both features in mind
I really want this feature 👍 How about adding some useful keys and changing indentation?
steps:
sd-setup-scm:
command: git clone foo bar....
retry: # object below or just `true`
condition: $GIT_SHALLOW_CLONE == true # optional
maxRetry: 3 # optional, default 1
interval: 3 # optional, default 0 (second)
environment: # optional
GIT_SHALLOW_CLONE: false
Adding retry options under retry
object makes sense.
Another ability a user asked for related to this issue was optionally being able to specify restarting from a previous job.
Also -- it would be ideal if condition could be a regex matcher or something for the log output. For example, scanning the output for .*dial tcp: i/o timeout.*
(and being able to set that restart config GLOBALLY in our template) would resolve more than 50% of our spurious failures.
Any update on this feature?