screwdriver icon indicating copy to clipboard operation
screwdriver copied to clipboard

Conditionally retry a step on failure

Open jithine opened this issue 5 years ago • 6 comments

What happened:

Sometimes a step can fail because of external dependencies. There is no option to retry the command under different circumstances without restarting a new build or code changes.

What you expected to happen:

Provide an option to conditionally retry failed steps.

  1. Screwdriver workflow to provide a step config to specify that step must be retried.
  2. Retry should support setting different environment variables.
  3. Optionally provide a condition which should determine whether retry should happen or not.
  4. For Screwdriver provided setup & teardown steps. cluster admins should be able to define the retry condition.
    1. User's can optionally specific retry condition without ability to override command
  5. Provide means to easily add retry configuration to multiple steps

For example

steps:
 sd-setup-scm:
    command: git clone foo bar....
    retry: # object below or just `true`
      condition: $GIT_SHALLOW_CLONE == true # optional
      maxRetry: 3 # optional, default 1
      interval: 3 # optional, default 0 (second)
      environment: # optional
         GIT_SHALLOW_CLONE: false

How to reproduce it:

N/A

jithine avatar May 22 '19 16:05 jithine

This is also related to https://github.com/screwdriver-cd/screwdriver/issues/1208 When making model changes we should keep both features in mind

jithine avatar May 22 '19 16:05 jithine

I really want this feature 👍 How about adding some useful keys and changing indentation?

steps:
 sd-setup-scm:
    command: git clone foo bar....
    retry: # object below or just `true`
      condition: $GIT_SHALLOW_CLONE == true # optional
      maxRetry: 3 # optional, default 1
      interval: 3 # optional, default 0 (second)
      environment: # optional
         GIT_SHALLOW_CLONE: false

catto avatar May 28 '19 02:05 catto

Adding retry options under retry object makes sense.

jithine avatar May 28 '19 19:05 jithine

Another ability a user asked for related to this issue was optionally being able to specify restarting from a previous job.

tkyi avatar Dec 18 '19 23:12 tkyi

Also -- it would be ideal if condition could be a regex matcher or something for the log output. For example, scanning the output for .*dial tcp: i/o timeout.* (and being able to set that restart config GLOBALLY in our template) would resolve more than 50% of our spurious failures.

rm-you avatar Mar 31 '20 18:03 rm-you

Any update on this feature?

jkusa avatar Nov 04 '21 18:11 jkusa