test-infra Kubernetes CI Policy: remove egregiously perma-failing jobs

Part of https://github.com/kubernetes/test-infra/issues/18551

Why this is important:

jobs that have been failing for hundreds of days are a drain on community resources
the fact that they've been failing this long means we've been getting by without their signal, it's probably more economical to cut our losses rather than make diving saves

http://storage.googleapis.com/k8s-metrics/failures-latest.json provides a list of jobs that have been failing continuously based on results stored in GCS. Note that not everything stored in GCS comes from prow.k8s.io; we allow for federated test results via https://github.com/kubernetes/test-infra/blob/master/kettle/buckets.yaml

Good candidates for removal include:

failing > 365 days
runs on prow.k8s.io but is testing out-of-support releases

Make sure to include either @spiffxp or @BenTheElder on PRs for these. Not all of these are clear cut removals and we may want to make efforts to find a job owner or otherwise find a way to mitigate.

We should close this issue once we decide what a formal definition of "egregious" is, and verify that we've handled everything that meets it. We should then feed whatever we've learned here into a policy of maintaining job health going forward (which is basically the end goal of https://github.com/kubernetes/test-infra/issues/18599 as well)

Aug 01 '20 03:08 spiffxp

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Oct 30 '20 04:10 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

Nov 29 '20 05:11 fejta-bot

/remove-lifecycle rotten

Dec 01 '20 00:12 BenTheElder

We still have egregiously perma-failing jobs. For example, the top 3 from http://storage.googleapis.com/k8s-metrics/failures-latest.json

  "ci-kubernetes-node-kubelet-serial": {
    "failing_days": 1098
  },
  "ci-kubernetes-e2enode-ubuntu2-k8sstable3-gkespec": {
    "failing_days": 1021
  },
  "ci-kubernetes-e2e-gci-gce-statefulset": {
    "failing_days": 969
  },

Jan 08 '21 21:01 spiffxp

https://github.com/kubernetes/test-infra/pull/21141 removed one

Need to refresh where we're at here.

Mar 04 '21 00:03 spiffxp

Jobs that fail 100% of Up or Test are good candidates - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5E(Up%7CTest)%24

Mar 06 '21 17:03 liggitt

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 04 '21 17:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

Jul 04 '21 17:07 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

Aug 03 '21 18:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 03 '21 18:08 k8s-ci-robot

/reopen /remove-lifecycle rotten

Aug 03 '21 18:08 dims

@dims: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 03 '21 18:08 k8s-ci-robot

/milestone v1.23

Aug 10 '21 16:08 spiffxp

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 08 '21 17:11 k8s-triage-robot

/remove-lifecycle stale /lifecycle frozen These jobs aren't going anywhere and this has to be dealt with someday

Nov 08 '21 18:11 BenTheElder

xref: https://github.com/kubernetes/kubernetes/issues/109521

Apr 18 '22 18:04 dims

/assign

Apr 18 '22 19:04 dims