test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

migrate away from `test-infra-trusted` build cluster

Open ameukam opened this issue 1 year ago • 14 comments

There are a few jobs running on the test-infra-trusted we should either migrate to k8s-infra-prow-build-trusted or remove:

  • [ ] post-test-infra-push-git
  • [ ] post-test-infra-push-git-custom-k8s-auth
  • [ ] post-test-infra-upload-testgrid-config
  • [ ] ci-test-infra-update-slack-oncall

ameukam avatar Apr 11 '24 16:04 ameukam

/assign @michelle192837 /sig testing

ameukam avatar Apr 11 '24 16:04 ameukam

ci-test-infra-update-slack-oncall

no point migrating this, we'll just shut it down when prow is migrated and instead people can posted in #testing-ops in slack.

we should actually probably proactively stop advertising @test-infra-oncall to the broader project.

post-test-infra-upload-testgrid-config

.... uhhhh this one I'm not sure, because we have to be able to publish to testgrid's config bucket .... migrating testgrid is another fun topic

The image publishing jobs we should be able to move over.

BenTheElder avatar Apr 17 '24 19:04 BenTheElder

re: ci-test-infra-update-slack-oncall: Ah, that's easier then.

re: post-test-infra-upload-testgrid-config: I think this should be doable. I have not gone through the full details, but imo thanks to config merger merging configs for TestGrid from multiple locations, we can stand up a new config upload job in community-owned infra, verify the uploaded config in the new location is the same as the old, and swap the config location used in the TestGrid instance overall.

michelle192837 avatar Apr 17 '24 20:04 michelle192837

On the K8s infra side we're going to need a bucket for this to start then, cc @upodroid @ameukam for thoughts.

post-test-infra-push-git post-test-infra-push-git-custom-k8s-auth

Not sure how these didn't wind up getting migrated yet ... looks like this is part of k8s-testimages https://github.com/kubernetes/k8s.io/issues/1523

I don't see evidence that we're actually using these images in Kubernetes and we should probably just delete them.

Prow has built in known-hosts handlinmg in clonerefs these days, I don't think we need these anymore.

BenTheElder avatar Jun 21 '24 01:06 BenTheElder

Sorry for the delay, I'm looking into this and some of the other unmigrated jobs today.

michelle192837 avatar Jun 21 '24 17:06 michelle192837

in #32808 the list should be clearer now, a lot of these are related to running prow so that's fine, but some are pushing images and that's concerning, we should either eliminate or migrate them.

BenTheElder avatar Jun 21 '24 17:06 BenTheElder

here's one https://github.com/kubernetes/test-infra/pull/32812

BenTheElder avatar Jun 21 '24 17:06 BenTheElder

File Path Job Link
config/jobs/kubernetes/test-infra/test-infra-periodics.yaml job-migration-todo-report Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-autobump-prow-for-auto-deploy Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-autobump-prow Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-update-slack-oncall Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-branchprotector Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-label-sync Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-gencred-refresh-kubeconfig Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-rotate-legacy-default-build-sa-json-key Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-alpine Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gcloud-terraform Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git-custom-k8s-auth Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-deploy-prow Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-reconcile-hmacs Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-misc-images Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-kettle Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-bazel Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gcb-docker-gcloud Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-test-gubernator Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gencred Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-gencred-refresh-kubeconfig Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-oncall Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-testgrid-config Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-boskos-config Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-cip-prow Search Results

SIG Contribex:

File Path Job Link
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-community-tempelis-apply Search Results

Not trusted cluster, but the other non-migrated jobs with test-infra in the name (there could be more) ...

File Path Job Link
config/jobs/kubernetes/test-infra/janitors.yaml maintenance-pull-janitor Search Results
config/jobs/kubernetes/test-infra/janitors.yaml maintenance-ci-aws-janitor Search Results
config/jobs/kubernetes/test-infra/janitors.yaml maintenance-ci-janitor Search Results

BenTheElder avatar Jun 21 '24 17:06 BenTheElder

Janitor jobs: won't be migrated, will be turned down.

post-test-infra-upload-oncall, ci-test-infra-update-slack-oncall: no need, this will be obsolete.

job-migration-todo-report: will be obsolete, also this isn't working correctly and we're just manually checking in the tool output, I'll clean this one up.

ci-test-infra-rotate-legacy-default-build-sa-json-key: will be obsolete

post-test-infra-upload-boskos-config: will be obsolete, we have a different boskos config in github.com/kubernetes/k8s.io for community boskos resources

post-test-infra-cip-prow: I deleted this in #32812

post-test-infra-push.* are concerning. post-test-infra-upload-testgrid-config will need migrating

I'm guessing renconcile hmacs needs to be considered as part of control plane migration, along with definitely branchprotector.

BenTheElder avatar Jun 21 '24 18:06 BenTheElder

https://github.com/kubernetes/test-infra/pull/32814 will remove the job-migration-todo-report report job.

ci-test-infra-label-sync should be able to migrate to k8s-infra-prow-build-trusted without waiting for the rest of prow, but we might not have the right secrets available yet.

BenTheElder avatar Jun 21 '24 18:06 BenTheElder

On the K8s infra side we're going to need a bucket for this to start then, cc @upodroid @ameukam for thoughts.

post-test-infra-push-git post-test-infra-push-git-custom-k8s-auth

Not sure how these didn't wind up getting migrated yet ... looks like this is part of k8s-testimages kubernetes/k8s.io#1523

I don't see evidence that we're actually using these images in Kubernetes and we should probably just delete them.

Prow has built in known-hosts handlinmg in clonerefs these days, I don't think we need these anymore.

These are used as the base images for building Prow images (https://cs.k8s.io/?q=gcr.io%2Fk8s-prow%2Fgit&i=nope&files=&excludeFiles=&repos=). I think we can replace the git image with alpine, but git-custom-k8s-auth might need to stay?

michelle192837 avatar Jun 21 '24 19:06 michelle192837

Job Link Uses
post-test-infra-push-alpine Search Results Search Results
post-test-infra-push-gcloud-terraform Search Results Search Results
post-test-infra-push-git Search Results Search Results
post-test-infra-push-git-custom-k8s-auth Search Results Search Results
post-test-infra-push-misc-images Search Results Search Results
post-test-infra-push-kettle Search Results Search Results
post-test-infra-push-bazel Search Results Search Results
post-test-infra-push-gcb-docker-gcloud Search Results Search Results
post-test-infra-push-test-gubernator Search Results Search Results
post-test-infra-push-gencred Search Results Search Results

Several of these push images that aren't used and should be turned down (post-test-infra-push-test-gubernator, post-test-infra-push-bazel, post-test-infra-push-gcloud-terraform, post-test-infra-push-gencred).

  • Note that post-test-infra-push-gencred hasn't succeeded, and pushed to k8s-testimages, which is not what jobs are using; jobs use the image pushed to k8s-prow and pushed by post-test-infra-push-misc-images)

michelle192837 avatar Jun 21 '24 19:06 michelle192837

Discussed offline: for post-test-infra-push-git and post-test-infra-push-git-custom-k8s-auth, since we'll need to migrate the latter anyways, we can migrate the former at the same time, then see if we can replace the git image base with alpine instead.

michelle192837 avatar Jun 25 '24 17:06 michelle192837

then see if we can replace the git image base with alpine instead.

we should probably use something else, we generally prefer to use e.g. debian/distroless for kubernetes base images, for licensing reasons (alpine/busybox) and alignment on patching etc.

BenTheElder avatar Jun 25 '24 22:06 BenTheElder

I'm working on tempelis https://kubernetes.slack.com/archives/C4M06S5HS/p1719431441099159 https://github.com/kubernetes/test-infra/pull/32928

BenTheElder avatar Jul 08 '24 21:07 BenTheElder

Sorry for the late response. I can confirm that git-custom-k8s-auth is used by prow to authenticate to non-GKE clusters (currently it's only EKS)

ameukam avatar Jul 10 '24 07:07 ameukam

https://github.com/kubernetes-sigs/prow/blob/main/.ko.yaml

+1 for building a unified base image for prow that has git, the kubectl auth plugins for our cloud vendors

upodroid avatar Jul 10 '24 10:07 upodroid

We can migrate that job to the community cluster and update the .ko.yaml references

upodroid avatar Jul 10 '24 10:07 upodroid

We can do something similar to the distroless-iptables image in k/release.

BenTheElder avatar Jul 10 '24 15:07 BenTheElder

tempelis will be done after #32946

BenTheElder avatar Jul 10 '24 20:07 BenTheElder

https://github.com/kubernetes/test-infra/pull/32948 will do label sync

BenTheElder avatar Jul 10 '24 23:07 BenTheElder

File Path Job Link Uses
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-alpine Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gcb-docker-gcloud Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git-custom-k8s-auth Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-kettle Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-misc-images Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-testgrid-config Search Results

With the linked PRs, we should have a canary job for all these jobs. Once these are submitted and we have new images for all of them, I'll switch the relevant uses to use the k8s-staging-test-infra images instead, and turn down the old image pushing jobs.

(The TestGrid config switch is a bit more involved but not much more. I just need to swap what config is referenced in the mergelists after verifying the new is the same as the old, and that config merger has permissions to read from the new bucket. I'll look into that now.)

michelle192837 avatar Jul 19 '24 19:07 michelle192837

After today's SIG meeting I eliminated the oncall update jobs (slack, GCS) #33083 #33084

We should probably pre-emptively migrate ci-test-infra-branchprotector to the new trusted cluster.

BenTheElder avatar Jul 23 '24 22:07 BenTheElder

migrating branch protector looks straightforward, will send a PR in a little bit.

BenTheElder avatar Jul 23 '24 22:07 BenTheElder

https://github.com/kubernetes/test-infra/pull/33098 takes care of the branch protector.

That leaves:

  • assorted image pushing (most of which are in https://github.com/kubernetes/test-infra/issues/32432#issuecomment-2240023962, but not e.g. the prow image push)
  • testgrid config upload
  • assorted boskos / janitor jobs we don't need to migrate (but we should make sure when we move prow that we do cleanup the boskos pools one more time afterwards, that's a fun new migration wrinkle I just thought of @michelle192837 @dims @upodroid @ameukam ... Googlers will probably have to run that for us but we can prepare the commands/scripts ...)
  • prow auto-deploy (we'll replace this? discussed in the meeting yesterday)
  • prow kubeconfig / credential rotation jobs (shouldn't need to migrate these?)

So when we move prow we'll also have a small list of jobs to disable and we should probably prepare that.

These are the main remaining jobs aside from the following out of scope here:

  • vsphere using jobs
  • azure using jobs (in progress...)
  • scale test presubmit and related janitor jobs (TODO, low prio)
  • something else we may have missed writing it off as one of the above (e.g. recently discovered https://github.com/kubernetes/test-infra/pull/33091)

So we should definitely focus on these while Azure folks work on migrating those.

I've also noticed that we'll have to be careful updating the prow deployment specs for the new cluster, because e.g. we gave the secrets clearer names and a different path for the github token.

BenTheElder avatar Jul 24 '24 17:07 BenTheElder

IMHO, we can remove post-test-infra-upload-boskos-config. we no longer need to increase the boskos pool and potentially need to shutdown the GCP projects part of it.

ameukam avatar Jul 26 '24 15:07 ameukam

Fixing TestGrid upload job today and cleaning up some of the image jobs/references.

michelle192837 avatar Jul 26 '24 18:07 michelle192837

IMHO, we can remove post-test-infra-upload-boskos-config. we no longer need to increase the boskos pool and potentially need to shutdown the GCP projects part of it.

agreed, filed https://github.com/kubernetes/test-infra/pull/33121

BenTheElder avatar Jul 26 '24 19:07 BenTheElder

TestGrid upload progress:

  • Job works now! https://testgrid.k8s.io/sig-testing-maintenance#testgrid-config-update-canary
  • Verified the contents are the same:
# See https://github.com/GoogleCloudPlatform/testgrid/tree/main/config/print#config-printer for the print utility.
~/go/bin/print gs://k8s-testgrid/configs/k8s/config > k8s-testgrid-config.textproto
~/go/bin/print gs://k8s-testgrid-config/k8s/config > k8s-infra-testgrid-config.textproto

diff k8s-testgrid-config.textproto k8s-infra-testgrid-config.textproto
# This produces no diffs

(And these do have contents):

wc -l k8s-testgrid-config.textproto 
519759 k8s-testgrid-config.textproto

wc -l k8s-infra-testgrid-config.textproto 
519759 k8s-infra-testgrid-config.textproto

Now following the config merger instructions at https://github.com/kubernetes/test-infra/blob/master/testgrid/merging.md#config-merger. I'll have a few PRs out for those.

michelle192837 avatar Jul 26 '24 23:07 michelle192837

Remaining from my list above:

File Path Job Link Uses
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-alpine Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-misc-images Search Results Search Results

post-test-infra-push-alpine just needs minor cleanup, then it can be deleted. post-test-infra-push-git can probably be deleted; the remaining use of it is as the base for certain Prow images. I can't switch them over immediately (integration tests fail when switching from the January image to a recent July image), but I believe switching to an image from the old location will have the same problem. post-test-infra-push-misc-images needs a fix (I think the most recent PR will fix it, but it needs a retrigger to verify that's the case), then the images need to be switched to the new location before the old job is turned down.

(And last bit of cleanup, move all the new image push jobs to the image-pushes dashboard and remove '-canary' from the job name).

michelle192837 avatar Jul 30 '24 17:07 michelle192837