k8s.io eks-prow-build-cluster: Decommission the canary cluster

We have something called the EKS Prow build canary cluster. Originally, it has been a test cluster for all sort of changes, but we are running into issues lately:

It costs money that we're just giving away to AWS, we don't get that much value from the cluster
We're very limited in terms what we can test there because that's not actually a Prow cluster, we can't run Prow jobs that easily there
It's a sort of a maintenance burden and we don't have such a cluster on GCP
And to the main point, this is making GitOps very hard. We now want to manage more manifests with Flux and we're facing issues that we have places where we need to put the cluster name and the account ID in manifests, which is not really nice to get working with Flux

We discussed this on Slack and decided to decommission the canary cluster. There are several tasks to take care of:

### Tasks
- [ ] Destroy the cluster and the infrastructure (`terraform destroy`)
- [ ] Decommission the AWS account
- [ ] Remove canary cluster from Terraform and Makefile
- [ ] Update docs to remove references to the canary cluster

/assign @xmudrii @koksay /sig k8s-infra /area infra infra/aws /milestone v1.30 /priority important-soon

Feb 19 '24 17:02 xmudrii

1.30 is out :-)

Apr 18 '24 02:04 BenTheElder

/milestone v1.31

Apr 18 '24 10:04 xmudrii

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 17 '24 10:07 k8s-triage-robot

/remove-lifecycle stale

Jul 17 '24 11:07 koksay

maybe we don't and use this to test machine types?

Jul 31 '24 21:07 BenTheElder

I thought that as well, but it takes quite a lot of time to configure that cluster to be a Prow build cluster (e.g. to deploy permissions, to fix some other stuff, get it up to date, connect that AWS account and the GKE account). I think it would be much easier to use a new tainted node pool, and then just move some jobs by adding tolerations and nodeSelectors, then monitor those jobs.

About this issue in general, this cluster always finds a way to be to prove useful, so I'd say we put this issue "on hold" (or even close it). It shouldn't be expensive at all, and if it helps to test and practice different things, I guess it's worth having it.

Jul 31 '24 21:07 xmudrii

I thought that as well, but it takes quite a lot of time to configure that cluster to be a Prow build cluster (e.g. to deploy permissions, to fix some other stuff, get it up to date, connect that AWS account and the GKE account).

Right, but didn't we already do this here?

I think it would be much easier to use a new tainted node pool, and then just move some jobs by adding tolerations and nodeSelectors, then monitor those jobs.

That's true, it's going to be more complicated/confusing for users and harder (but doable) to (prow config) unit test effectively though.

I always regret when we make job configs even more verbose :(

Jul 31 '24 21:07 BenTheElder

Right, but didn't we already do this here?

No, unfortunately. We decided that we don't want to run jobs on this cluster in the very early days, so this cluster was never configured as a build cluster. It's just a "normal" EKS cluster where we can test different configuration changes. Maybe we can change this once the CI migration is done.

That's true, it's going to be more complicated/confusing for users and harder (but doable) to (prow config) unit test effectively though.

If this turns out to work, it'll become a part of the default node pool, so tolerations and nodeSelectors will not be needed. I can look into writing some unit tests for this along the way.

Jul 31 '24 21:07 xmudrii

Let's not proceed with this issue at the moment, we can reconsider it in the future. /close

Aug 27 '24 13:08 xmudrii

@xmudrii: Closing this issue.

In response to this:

Let's not proceed with this issue at the moment, we can reconsider it in the future. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 27 '24 13:08 k8s-ci-robot

k8s.io k8s.io copied to clipboard

eks-prow-build-cluster: Decommission the canary cluster

k8s.io
k8s.io copied to clipboard