k8s.io icon indicating copy to clipboard operation
k8s.io copied to clipboard

eks-prow-build-cluster: Decommission the canary cluster

Open xmudrii opened this issue 1 year ago • 2 comments

We have something called the EKS Prow build canary cluster. Originally, it has been a test cluster for all sort of changes, but we are running into issues lately:

  • It costs money that we're just giving away to AWS, we don't get that much value from the cluster
  • We're very limited in terms what we can test there because that's not actually a Prow cluster, we can't run Prow jobs that easily there
  • It's a sort of a maintenance burden and we don't have such a cluster on GCP
  • And to the main point, this is making GitOps very hard. We now want to manage more manifests with Flux and we're facing issues that we have places where we need to put the cluster name and the account ID in manifests, which is not really nice to get working with Flux

We discussed this on Slack and decided to decommission the canary cluster. There are several tasks to take care of:

### Tasks
- [ ] Destroy the cluster and the infrastructure (`terraform destroy`)
- [ ] Decommission the AWS account
- [ ] Remove canary cluster from Terraform and Makefile
- [ ] Update docs to remove references to the canary cluster

/assign @xmudrii @koksay /sig k8s-infra /area infra infra/aws /milestone v1.30 /priority important-soon

xmudrii avatar Feb 19 '24 17:02 xmudrii

1.30 is out :-)

BenTheElder avatar Apr 18 '24 02:04 BenTheElder

/milestone v1.31

xmudrii avatar Apr 18 '24 10:04 xmudrii

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 17 '24 10:07 k8s-triage-robot

/remove-lifecycle stale

koksay avatar Jul 17 '24 11:07 koksay

maybe we don't and use this to test machine types?

BenTheElder avatar Jul 31 '24 21:07 BenTheElder

I thought that as well, but it takes quite a lot of time to configure that cluster to be a Prow build cluster (e.g. to deploy permissions, to fix some other stuff, get it up to date, connect that AWS account and the GKE account). I think it would be much easier to use a new tainted node pool, and then just move some jobs by adding tolerations and nodeSelectors, then monitor those jobs.

About this issue in general, this cluster always finds a way to be to prove useful, so I'd say we put this issue "on hold" (or even close it). It shouldn't be expensive at all, and if it helps to test and practice different things, I guess it's worth having it.

xmudrii avatar Jul 31 '24 21:07 xmudrii

I thought that as well, but it takes quite a lot of time to configure that cluster to be a Prow build cluster (e.g. to deploy permissions, to fix some other stuff, get it up to date, connect that AWS account and the GKE account).

Right, but didn't we already do this here?

I think it would be much easier to use a new tainted node pool, and then just move some jobs by adding tolerations and nodeSelectors, then monitor those jobs.

That's true, it's going to be more complicated/confusing for users and harder (but doable) to (prow config) unit test effectively though.

I always regret when we make job configs even more verbose :(

BenTheElder avatar Jul 31 '24 21:07 BenTheElder

Right, but didn't we already do this here?

No, unfortunately. We decided that we don't want to run jobs on this cluster in the very early days, so this cluster was never configured as a build cluster. It's just a "normal" EKS cluster where we can test different configuration changes. Maybe we can change this once the CI migration is done.

That's true, it's going to be more complicated/confusing for users and harder (but doable) to (prow config) unit test effectively though.

If this turns out to work, it'll become a part of the default node pool, so tolerations and nodeSelectors will not be needed. I can look into writing some unit tests for this along the way.

xmudrii avatar Jul 31 '24 21:07 xmudrii

Let's not proceed with this issue at the moment, we can reconsider it in the future. /close

xmudrii avatar Aug 27 '24 13:08 xmudrii

@xmudrii: Closing this issue.

In response to this:

Let's not proceed with this issue at the moment, we can reconsider it in the future. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 27 '24 13:08 k8s-ci-robot