prow icon indicating copy to clipboard operation
prow copied to clipboard

Config to automatically Re-trigger failed periodics

Open smg247 opened this issue 1 year ago • 6 comments

OpenShift has certain infra related periodics that run on a daily (or similar) frequency. This is only because the jobs are sometimes flaky, and the subsequent run will pass. The frequency could be reduced to weekly if there was a guarantee that the job would be retried a number of times if it fails.

A new config could be added to support automatically re-triggering a periodic ProwJob only in the case that it fails. It would accept the number of times to retry, and the interval at which to trigger the re-run. Something like the following to retrigger a failed job 3 times, 6 hours apart:

retrigger-failed-run:
  attempts: 3
  interval: 6h

Implementation details: I think that it may be possible for plank to handle the retriggers

smg247 avatar Sep 05 '24 14:09 smg247

/cc @stbenjam

smg247 avatar Sep 05 '24 14:09 smg247

Thanks, this is great!

The only question I have is if the job fails, if we should be able to unequivocally run it 3 more times, or only run it until it succeeds. The former would help offset the bad signal and give us more confidence in the job's reliability.

Maybe configurable?

retrigger-failed-run:
  strategy: until_success | run_all
  attempts: 3
  interval: 6h

/cc @deads2k

stbenjam avatar Sep 05 '24 16:09 stbenjam

Configurable would be good.

  1. Sometimes we want definitely three more times
  2. Sometimes we want run up to three more times for a success.

deads2k avatar Sep 05 '24 18:09 deads2k

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

I think that it may be possible for plank to handle the retriggers

Almost certainly not plank. Plank consumes Prowjobs, should not create them (unless you'd do the re-runs as additional Pods for a single Prowjob, for which we would need to rethink big parts of e.g. artifact reporting). I believe this belongs to horologium, especially with the interval: 6h config. We'd probably need some horologium-specific annotations on Prowjobs to recognize their position in a retest series and prevent each subsequent failure to cause a new round of retests.

There are more fun interactions to resolve, like how do the retests interact with standard interval-triggered periodics? Would they delay them?

petr-muller avatar Sep 06 '24 09:09 petr-muller

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

For infrequently run jobs, we want to be able to still detect subtler regressions. If a developer makes an existing test go from 99% to 50%, we'll eventually get a failure on weekly runs -- that's our first hint there's something wrong, but we need more than 1 additional attempt to confirm. We'd get it eventually but it could take a month+. The unconditional attempts is a signal booster.

stbenjam avatar Sep 06 '24 13:09 stbenjam

Okay, that makes sense, I see the value now :+1: It helps to amplify subtle decreases in reliability while saving resources because jobs that we think are solid may not need to run as often.

petr-muller avatar Sep 06 '24 13:09 petr-muller

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 24 '24 13:12 k8s-triage-robot

/remove-lifecycle stale

smg247 avatar Jan 02 '25 12:01 smg247

/assign @jmguzik

jmguzik avatar Jan 15 '25 07:01 jmguzik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 15 '25 08:04 k8s-triage-robot

This was implemented in #358

/close

petr-muller avatar Apr 15 '25 16:04 petr-muller

@petr-muller: Closing this issue.

In response to this:

This was implemented in #358

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Apr 15 '25 16:04 k8s-ci-robot