Kubernetes CI Policy: remove egregiously perma-failing jobs
Part of https://github.com/kubernetes/test-infra/issues/18551
Why this is important:
- jobs that have been failing for hundreds of days are a drain on community resources
- the fact that they've been failing this long means we've been getting by without their signal, it's probably more economical to cut our losses rather than make diving saves
http://storage.googleapis.com/k8s-metrics/failures-latest.json provides a list of jobs that have been failing continuously based on results stored in GCS. Note that not everything stored in GCS comes from prow.k8s.io; we allow for federated test results via https://github.com/kubernetes/test-infra/blob/master/kettle/buckets.yaml
Good candidates for removal include:
- failing > 365 days
- runs on prow.k8s.io but is testing out-of-support releases
Make sure to include either @spiffxp or @BenTheElder on PRs for these. Not all of these are clear cut removals and we may want to make efforts to find a job owner or otherwise find a way to mitigate.
We should close this issue once we decide what a formal definition of "egregious" is, and verify that we've handled everything that meets it. We should then feed whatever we've learned here into a policy of maintaining job health going forward (which is basically the end goal of https://github.com/kubernetes/test-infra/issues/18599 as well)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
/remove-lifecycle rotten
We still have egregiously perma-failing jobs. For example, the top 3 from http://storage.googleapis.com/k8s-metrics/failures-latest.json
"ci-kubernetes-node-kubelet-serial": {
"failing_days": 1098
},
"ci-kubernetes-e2enode-ubuntu2-k8sstable3-gkespec": {
"failing_days": 1021
},
"ci-kubernetes-e2e-gci-gce-statefulset": {
"failing_days": 969
},
https://github.com/kubernetes/test-infra/pull/21141 removed one
Need to refresh where we're at here.
Jobs that fail 100% of Up or Test are good candidates - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5E(Up%7CTest)%24
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-contributor-experience at kubernetes/community. /close
@k8s-triage-robot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-contributor-experience at kubernetes/community. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen /remove-lifecycle rotten
@dims: Reopened this issue.
In response to this:
/reopen /remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/milestone v1.23
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen These jobs aren't going anywhere and this has to be dealt with someday
xref: https://github.com/kubernetes/kubernetes/issues/109521
/assign