Cleanup usage of kubernetes-release-pull in kubernetes presubmits
What should be cleaned up or changed:
We stage builds to gs://kubernetes-release-pull in almost every presubmit job.
But from what I can tell nothing is actually consuming those builds since the jobs also use extract=local .
It's a non-trivial overhead to upload the release tars in every presubmit and we should remove all the non-required usages.
Provide any links for context: https://cs.k8s.io/?q=kubernetes-release-pull&i=nope&files=&repos= https://github.com/kubernetes/test-infra/blob/c4628a3a1c2e4c149f669348f351fe78ebd1f258/kubetest/extract_k8s.go#L449-L466 Random GCE provider job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1293275406807339008#1:build-log.txt%3A903
/cc @spiffxp @BenTheElder @MushuEE
EDIT(@spiffxp): I made a list of the offending jobs going off the criteria --extract=local and --stage=gs://kubernetes-release-pull/*
- if the job triggers for a single branch it's labeled as
job@branch - if the job triggers for all branches it's labeled as
job - there are no presubmits that trigger for N branches (where all > N > 1)
- there are no periodics or postsubmits that touch gs://kubernetes-release-pull
- this picks up some
--provider=awsjobs (kops), it remains to be seen whether they need--stageor not
The jobs to fixed are:
- [ ] pull-kubernetes-e2e-containerd-gce
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] pull-kubernetes-e2e-gce
- [ ] pull-kubernetes-e2e-gce-100-performance
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] pull-kubernetes-e2e-gce-alpha-features@master
- [ ] pull-kubernetes-e2e-gce-big-performance@master
- [ ] pull-kubernetes-e2e-gce-canary
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] pull-kubernetes-e2e-gce-correctness
- [ ] pull-kubernetes-e2e-gce-csi-serial@master
- [ ] pull-kubernetes-e2e-gce-device-plugin-gpu
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] pull-kubernetes-e2e-gce-iscsi-serial@master
- [ ] pull-kubernetes-e2e-gce-iscsi@master
- [ ] pull-kubernetes-e2e-gce-large-performance@master
- [ ] pull-kubernetes-e2e-gce-network-proxy-grpc
- [ ] pull-kubernetes-e2e-gce-network-proxy-http-connect
- [ ] pull-kubernetes-e2e-gce-storage-disruptive@master
- [ ] pull-kubernetes-e2e-gce-storage-slow@master
- [ ] pull-kubernetes-e2e-gce-storage-snapshot@master
- [ ] pull-kubernetes-e2e-gce-ubuntu
- [ ] pull-kubernetes-e2e-gce-ubuntu-containerd
- [ ] pull-kubernetes-e2e-gce-ubuntu-containerd-canary
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] pull-kubernetes-e2e-gci-gce-autoscaling
- [ ] pull-kubernetes-e2e-gci-gce-ingress@master
- [ ] pull-kubernetes-e2e-gci-gce-ipvs
- [ ] pull-kubernetes-e2e-kops-aws
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] pull-kubernetes-e2e-ubuntu-gce-network-policies@master
- [ ] pull-kubernetes-e2e-windows-gce@master
- [ ] pull-kubernetes-kubemark-e2e-gce-big
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] [email protected]
- [ ] pull-kubernetes-kubemark-e2e-gce-scale@master
- [ ] pull-release-cluster-up
we should test this in a canary just because this stuff is old and brittle and I can't remember why we were doing this anymore 🙃
update: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-e2e-gce-no-stage/1294345203125063682/#1:build-log.txt%3A355
the local path https://github.com/kubernetes/test-infra/blob/c4628a3a1c2e4c149f669348f351fe78ebd1f258/kubetest/extract_k8s.go#L450
comes from https://github.com/kubernetes/release/blob/8d6bd15010efeec44018e4847860d464d2682d97/lib/releaselib.sh#L1245-L1247
which shouldn't be needed since we already have them under
bazel-bin/build/release-tars (but without the hashes)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-e2e-gce-no-stage/1294345203125063682/#1:build-log.txt%3A338
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
still worth doing?
/remove-lifecycle rotten I think so. The other option is to continue as-is, meaning jobs that use this bucket need to switch to use k8s-release-pull as they migrate to k8s-infra.
sadly looks like those gcs links have been gced.
seems like one of the steps involved as part of --stage is copying the artifacts from the bazel output path to the make output path _output/gcs-stage and then uploading them to gcs.
and our presubmit jobs are configured to --extract=local instead of --extract=bazel while using --build=bazel
so they were relying on them being in the make output path.
https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml#L48
testing out in the canary job: https://github.com/kubernetes/test-infra/pull/20427
/milestone v1.21 /sig testing /wg-k8s-infra
We have a succesful run at https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce-no-stage/1352340847076577280
not sure why the total test duration is higher as compared to https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1351850610516824064
but we atleast saved 154 seconds of stage time (which should be the only delta here)
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-e2e-gce-no-stage/1352340847076577280/artifacts/junit_runner.xml
as compared to
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/97894/pull-kubernetes-e2e-gce/1351850610516824064/artifacts/junit_runner.xml
and 1.84 GiB of unnecessary GCS uploads
$ gsutil du -sh gs://kubernetes-release-pull/ci/pull-kubernetes-e2e-gce/v1.18.16-rc.0.3+9f5c61d324a62b
1.84 GiB gs://kubernetes-release-pull/ci/pull-kubernetes-e2e-gce/v1.18.16-rc.0.3+9f5c61d324a62b
/priority important-soon
/assign @amwat @spiffxp Assigning to us for now. If we think this is eligible for /help or don't have time to do it ourselves we can writeup how to proceed
(either way) Next steps:
All jobs in question:
https://cs.k8s.io/?q=kubernetes-release-pull&i=nope&files=&repos= i.e. those with --stage=gs://kubernetes-release-pull.*
affected jobs: those with --provider=gce --extract=local --build=bazel
should be fixed to have --extract=bazel and no --stage
short of having a canary job for every affected job a reasonable, while having less disruption, compromise would be to start with those that are non-blocking e.g. https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-alpha-features and once they are monitored to be healthy, fix the blocking jobs.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
/remove-lifecycle rotten /milestone v1.22
Updated the description with a list of jobs to update as generated by the test I'm adding in https://github.com/kubernetes/test-infra/pull/22890
go test -v -count=1 \
./config/tests/jobs/ \
-run TestKubernetesPresubmitsShouldNotUseKubernetesReleasePullBucket \
| grep "should not" \
| cut -d: -f4 \
| sort | uniq \
| sed -e 's/^/- [ ]/'
Opened https://github.com/kubernetes/test-infra/pull/22892 to go after that jobs that will be most obvious if this somehow ends up having an unforeseen negative impact
https://github.com/kubernetes/test-infra/pull/22892 broke things, ref: https://github.com/kubernetes/test-infra/pull/22892#issuecomment-880245096
The analysis in #18789 (comment) depended on bazel doing the local staging, but now that there's no bazel we seem to be still relying on just the local staging part of the --stage for --extract=local.
Reverted in https://github.com/kubernetes/test-infra/pull/22894
/milestone v1.23 Let's just wait until v1.22 goes out the door before we bother with this again
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/lifecycle frozen this is a requirement for smooth migration /help
@BenTheElder: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/lifecycle frozen this is a requirement for smooth migration /help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Based on https://github.com/kubernetes/test-infra/pull/22892#issuecomment-880245096 (and the changes being reverted), how should we proceed with this? I started working through this and realized I was re-doing @spiffxp's changes :smile:
Hi, I am interested to work on this issue but I have some questions or queries.
- Seems like the list of jobs to be fixed is outdated
- Please help me to understand the fix we need to follow here
I don't think the fix @amwat mentioned here here don't apply because we are not using
bazelto build. So as discussed here https://github.com/kubernetes/test-infra/pull/24238#issuecomment-961504149, can we now remove extract and stage? Please feel free to correct me if I am wrong
cc @spiffxp @BenTheElder @ameukam
Sorry, a couple of the people you pinged don't work on tthis anymore and I'm kinda buried.
I've lost context on this one.
I'm not sure we ever got no-stage working? It's hard to follow at this point.
https://github.com/kubernetes/test-infra/pull/28176 renamed the test job, testing in https://github.com/kubernetes/kubernetes/pull/126563
It does, it will stage to a generated bucket under the rented boskos project (which the boskos janitors should clean up if they don't already), so we can carefully start dropping these I think ... very belatedly.
beginning bulk migration in https://github.com/kubernetes/test-infra/pull/33259, starting with a subset of optional, non-blocking, not always_run jobs
We have to drop both --extract=local and --stage at the same time. We don't need to locally extract what we just built, it's running fine and uploading to a bucket under the boskos project.
You can see sample runs in https://github.com/kubernetes/kubernetes/pull/126563
Inspect these logs: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/126563/pull-kubernetes-e2e-gce-cos-no-stage/1820909457454927872 https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/126563/pull-kubernetes-e2e-gce-pull-through-cache/1821254221408768000