k8s.io
k8s.io copied to clipboard
eks-prow-build-cluster: Decommission the canary cluster
We have something called the EKS Prow build canary cluster. Originally, it has been a test cluster for all sort of changes, but we are running into issues lately:
- It costs money that we're just giving away to AWS, we don't get that much value from the cluster
- We're very limited in terms what we can test there because that's not actually a Prow cluster, we can't run Prow jobs that easily there
- It's a sort of a maintenance burden and we don't have such a cluster on GCP
- And to the main point, this is making GitOps very hard. We now want to manage more manifests with Flux and we're facing issues that we have places where we need to put the cluster name and the account ID in manifests, which is not really nice to get working with Flux
We discussed this on Slack and decided to decommission the canary cluster. There are several tasks to take care of:
### Tasks
- [ ] Destroy the cluster and the infrastructure (`terraform destroy`)
- [ ] Decommission the AWS account
- [ ] Remove canary cluster from Terraform and Makefile
- [ ] Update docs to remove references to the canary cluster
/assign @xmudrii @koksay /sig k8s-infra /area infra infra/aws /milestone v1.30 /priority important-soon
1.30 is out :-)
/milestone v1.31
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
maybe we don't and use this to test machine types?
I thought that as well, but it takes quite a lot of time to configure that cluster to be a Prow build cluster (e.g. to deploy permissions, to fix some other stuff, get it up to date, connect that AWS account and the GKE account). I think it would be much easier to use a new tainted node pool, and then just move some jobs by adding tolerations and nodeSelectors, then monitor those jobs.
About this issue in general, this cluster always finds a way to be to prove useful, so I'd say we put this issue "on hold" (or even close it). It shouldn't be expensive at all, and if it helps to test and practice different things, I guess it's worth having it.
I thought that as well, but it takes quite a lot of time to configure that cluster to be a Prow build cluster (e.g. to deploy permissions, to fix some other stuff, get it up to date, connect that AWS account and the GKE account).
Right, but didn't we already do this here?
I think it would be much easier to use a new tainted node pool, and then just move some jobs by adding tolerations and nodeSelectors, then monitor those jobs.
That's true, it's going to be more complicated/confusing for users and harder (but doable) to (prow config) unit test effectively though.
I always regret when we make job configs even more verbose :(
Right, but didn't we already do this here?
No, unfortunately. We decided that we don't want to run jobs on this cluster in the very early days, so this cluster was never configured as a build cluster. It's just a "normal" EKS cluster where we can test different configuration changes. Maybe we can change this once the CI migration is done.
That's true, it's going to be more complicated/confusing for users and harder (but doable) to (prow config) unit test effectively though.
If this turns out to work, it'll become a part of the default node pool, so tolerations and nodeSelectors will not be needed. I can look into writing some unit tests for this along the way.
Let's not proceed with this issue at the moment, we can reconsider it in the future. /close
@xmudrii: Closing this issue.
In response to this:
Let's not proceed with this issue at the moment, we can reconsider it in the future. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.