test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Cleanup usage of kubernetes-release-pull in kubernetes presubmits

Open amwat opened this issue 5 years ago • 26 comments

What should be cleaned up or changed: We stage builds to gs://kubernetes-release-pull in almost every presubmit job. But from what I can tell nothing is actually consuming those builds since the jobs also use extract=local . It's a non-trivial overhead to upload the release tars in every presubmit and we should remove all the non-required usages.

Provide any links for context: https://cs.k8s.io/?q=kubernetes-release-pull&i=nope&files=&repos= https://github.com/kubernetes/test-infra/blob/c4628a3a1c2e4c149f669348f351fe78ebd1f258/kubetest/extract_k8s.go#L449-L466 Random GCE provider job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1293275406807339008#1:build-log.txt%3A903

/cc @spiffxp @BenTheElder @MushuEE


EDIT(@spiffxp): I made a list of the offending jobs going off the criteria --extract=local and --stage=gs://kubernetes-release-pull/*

  • if the job triggers for a single branch it's labeled as job@branch
  • if the job triggers for all branches it's labeled as job
  • there are no presubmits that trigger for N branches (where all > N > 1)
  • there are no periodics or postsubmits that touch gs://kubernetes-release-pull
  • this picks up some --provider=aws jobs (kops), it remains to be seen whether they need --stage or not

The jobs to fixed are:

amwat avatar Aug 11 '20 22:08 amwat

we should test this in a canary just because this stuff is old and brittle and I can't remember why we were doing this anymore 🙃

BenTheElder avatar Aug 13 '20 05:08 BenTheElder

update: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-e2e-gce-no-stage/1294345203125063682/#1:build-log.txt%3A355

the local path https://github.com/kubernetes/test-infra/blob/c4628a3a1c2e4c149f669348f351fe78ebd1f258/kubetest/extract_k8s.go#L450

comes from https://github.com/kubernetes/release/blob/8d6bd15010efeec44018e4847860d464d2682d97/lib/releaselib.sh#L1245-L1247

which shouldn't be needed since we already have them under bazel-bin/build/release-tars (but without the hashes) https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-e2e-gce-no-stage/1294345203125063682/#1:build-log.txt%3A338

amwat avatar Aug 14 '20 19:08 amwat

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Nov 26 '20 10:11 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Dec 26 '20 10:12 fejta-bot

still worth doing?

BenTheElder avatar Jan 06 '21 19:01 BenTheElder

/remove-lifecycle rotten I think so. The other option is to continue as-is, meaning jobs that use this bucket need to switch to use k8s-release-pull as they migrate to k8s-infra.

spiffxp avatar Jan 08 '21 22:01 spiffxp

sadly looks like those gcs links have been gced. seems like one of the steps involved as part of --stage is copying the artifacts from the bazel output path to the make output path _output/gcs-stage and then uploading them to gcs.

and our presubmit jobs are configured to --extract=local instead of --extract=bazel while using --build=bazel so they were relying on them being in the make output path. https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml#L48

testing out in the canary job: https://github.com/kubernetes/test-infra/pull/20427

amwat avatar Jan 08 '21 23:01 amwat

/milestone v1.21 /sig testing /wg-k8s-infra

spiffxp avatar Jan 21 '21 19:01 spiffxp

We have a succesful run at https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce-no-stage/1352340847076577280

not sure why the total test duration is higher as compared to https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1351850610516824064

but we atleast saved 154 seconds of stage time (which should be the only delta here)

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-e2e-gce-no-stage/1352340847076577280/artifacts/junit_runner.xml

as compared to

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/97894/pull-kubernetes-e2e-gce/1351850610516824064/artifacts/junit_runner.xml

and 1.84 GiB of unnecessary GCS uploads

$ gsutil du -sh gs://kubernetes-release-pull/ci/pull-kubernetes-e2e-gce/v1.18.16-rc.0.3+9f5c61d324a62b
1.84 GiB     gs://kubernetes-release-pull/ci/pull-kubernetes-e2e-gce/v1.18.16-rc.0.3+9f5c61d324a62b

amwat avatar Jan 21 '21 21:01 amwat

/priority important-soon

spiffxp avatar Jan 22 '21 23:01 spiffxp

/assign @amwat @spiffxp Assigning to us for now. If we think this is eligible for /help or don't have time to do it ourselves we can writeup how to proceed

spiffxp avatar Jan 26 '21 23:01 spiffxp

(either way) Next steps: All jobs in question: https://cs.k8s.io/?q=kubernetes-release-pull&i=nope&files=&repos= i.e. those with --stage=gs://kubernetes-release-pull.*

affected jobs: those with --provider=gce --extract=local --build=bazel should be fixed to have --extract=bazel and no --stage

short of having a canary job for every affected job a reasonable, while having less disruption, compromise would be to start with those that are non-blocking e.g. https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-alpha-features and once they are monitored to be healthy, fix the blocking jobs.

amwat avatar Jan 27 '21 03:01 amwat

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Apr 27 '21 04:04 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot avatar May 27 '21 05:05 fejta-bot

/remove-lifecycle rotten /milestone v1.22

spiffxp avatar Jun 09 '21 19:06 spiffxp

Updated the description with a list of jobs to update as generated by the test I'm adding in https://github.com/kubernetes/test-infra/pull/22890

go test -v -count=1 \
  ./config/tests/jobs/ \
  -run TestKubernetesPresubmitsShouldNotUseKubernetesReleasePullBucket \
| grep "should not" \
| cut -d: -f4 \
| sort | uniq \
| sed -e 's/^/- [ ]/'

spiffxp avatar Jul 14 '21 19:07 spiffxp

Opened https://github.com/kubernetes/test-infra/pull/22892 to go after that jobs that will be most obvious if this somehow ends up having an unforeseen negative impact

spiffxp avatar Jul 14 '21 20:07 spiffxp

https://github.com/kubernetes/test-infra/pull/22892 broke things, ref: https://github.com/kubernetes/test-infra/pull/22892#issuecomment-880245096

The analysis in #18789 (comment) depended on bazel doing the local staging, but now that there's no bazel we seem to be still relying on just the local staging part of the --stage for --extract=local.

Reverted in https://github.com/kubernetes/test-infra/pull/22894

spiffxp avatar Jul 15 '21 15:07 spiffxp

/milestone v1.23 Let's just wait until v1.22 goes out the door before we bother with this again

spiffxp avatar Jul 27 '21 03:07 spiffxp

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 28 '21 21:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 27 '22 21:01 k8s-triage-robot

/lifecycle frozen this is a requirement for smooth migration /help

BenTheElder avatar Jan 27 '22 22:01 BenTheElder

@BenTheElder: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/lifecycle frozen this is a requirement for smooth migration /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 27 '22 22:01 k8s-ci-robot

Based on https://github.com/kubernetes/test-infra/pull/22892#issuecomment-880245096 (and the changes being reverted), how should we proceed with this? I started working through this and realized I was re-doing @spiffxp's changes :smile:

jbpratt avatar Oct 23 '22 13:10 jbpratt

Hi, I am interested to work on this issue but I have some questions or queries.

  1. Seems like the list of jobs to be fixed is outdated
  2. Please help me to understand the fix we need to follow here I don't think the fix @amwat mentioned here here don't apply because we are not using bazel to build. So as discussed here https://github.com/kubernetes/test-infra/pull/24238#issuecomment-961504149, can we now remove extract and stage? Please feel free to correct me if I am wrong

cc @spiffxp @BenTheElder @ameukam

SD-13 avatar Dec 09 '23 20:12 SD-13

Sorry, a couple of the people you pinged don't work on tthis anymore and I'm kinda buried.

I've lost context on this one.

BenTheElder avatar Apr 01 '24 23:04 BenTheElder

I'm not sure we ever got no-stage working? It's hard to follow at this point.

BenTheElder avatar Aug 06 '24 19:08 BenTheElder

https://github.com/kubernetes/test-infra/pull/28176 renamed the test job, testing in https://github.com/kubernetes/kubernetes/pull/126563

BenTheElder avatar Aug 06 '24 19:08 BenTheElder

It does, it will stage to a generated bucket under the rented boskos project (which the boskos janitors should clean up if they don't already), so we can carefully start dropping these I think ... very belatedly.

BenTheElder avatar Aug 06 '24 20:08 BenTheElder

beginning bulk migration in https://github.com/kubernetes/test-infra/pull/33259, starting with a subset of optional, non-blocking, not always_run jobs

We have to drop both --extract=local and --stage at the same time. We don't need to locally extract what we just built, it's running fine and uploading to a bucket under the boskos project.

You can see sample runs in https://github.com/kubernetes/kubernetes/pull/126563

Inspect these logs: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/126563/pull-kubernetes-e2e-gce-cos-no-stage/1820909457454927872 https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/126563/pull-kubernetes-e2e-gce-pull-through-cache/1821254221408768000

BenTheElder avatar Aug 07 '24 20:08 BenTheElder