test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Kubernetes CI Policy: remove egregiously perma-failing jobs

Open spiffxp opened this issue 5 years ago • 17 comments

Part of https://github.com/kubernetes/test-infra/issues/18551

Why this is important:

  • jobs that have been failing for hundreds of days are a drain on community resources
  • the fact that they've been failing this long means we've been getting by without their signal, it's probably more economical to cut our losses rather than make diving saves

http://storage.googleapis.com/k8s-metrics/failures-latest.json provides a list of jobs that have been failing continuously based on results stored in GCS. Note that not everything stored in GCS comes from prow.k8s.io; we allow for federated test results via https://github.com/kubernetes/test-infra/blob/master/kettle/buckets.yaml

Good candidates for removal include:

  • failing > 365 days
  • runs on prow.k8s.io but is testing out-of-support releases

Make sure to include either @spiffxp or @BenTheElder on PRs for these. Not all of these are clear cut removals and we may want to make efforts to find a job owner or otherwise find a way to mitigate.

We should close this issue once we decide what a formal definition of "egregious" is, and verify that we've handled everything that meets it. We should then feed whatever we've learned here into a policy of maintaining job health going forward (which is basically the end goal of https://github.com/kubernetes/test-infra/issues/18599 as well)

spiffxp avatar Aug 01 '20 03:08 spiffxp

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Oct 30 '20 04:10 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Nov 29 '20 05:11 fejta-bot

/remove-lifecycle rotten

BenTheElder avatar Dec 01 '20 00:12 BenTheElder

We still have egregiously perma-failing jobs. For example, the top 3 from http://storage.googleapis.com/k8s-metrics/failures-latest.json

  "ci-kubernetes-node-kubelet-serial": {
    "failing_days": 1098
  },
  "ci-kubernetes-e2enode-ubuntu2-k8sstable3-gkespec": {
    "failing_days": 1021
  },
  "ci-kubernetes-e2e-gci-gce-statefulset": {
    "failing_days": 969
  },

spiffxp avatar Jan 08 '21 21:01 spiffxp

https://github.com/kubernetes/test-infra/pull/21141 removed one

Need to refresh where we're at here.

spiffxp avatar Mar 04 '21 00:03 spiffxp

Jobs that fail 100% of Up or Test are good candidates - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5E(Up%7CTest)%24

liggitt avatar Mar 06 '21 17:03 liggitt

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Jun 04 '21 17:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot avatar Jul 04 '21 17:07 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-triage-robot avatar Aug 03 '21 18:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 03 '21 18:08 k8s-ci-robot

/reopen /remove-lifecycle rotten

dims avatar Aug 03 '21 18:08 dims

@dims: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 03 '21 18:08 k8s-ci-robot

/milestone v1.23

spiffxp avatar Aug 10 '21 16:08 spiffxp

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 08 '21 17:11 k8s-triage-robot

/remove-lifecycle stale /lifecycle frozen These jobs aren't going anywhere and this has to be dealt with someday

BenTheElder avatar Nov 08 '21 18:11 BenTheElder

xref: https://github.com/kubernetes/kubernetes/issues/109521

dims avatar Apr 18 '22 18:04 dims

/assign

dims avatar Apr 18 '22 19:04 dims