test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Document CI cluster selection, CPU : RAM ratio / machine types, and general recommendations specific to prow.k8s.io

Open BenTheElder opened this issue 10 months ago • 9 comments
trafficstars

What would you like to be added:

We don't have a single place to point to regarding which cluster: you should use and why, and how much resources to use (and how to avoid pointlessly scheduling minuscule amounts of memory per CPU core, which ultimately costs us more when workloads prefer more CPU time to allocating and memory sits unused).

We should do this per-cluster and create a doc somewhere discoverable, perhaps under config/jobs.

We should also consider adding details like:

  • kubekins / CI image recommendations
    • docker in docker
  • Additional pointers to the hacks we have employed in the clusters (like pre-allocating loop devices, tuning sysctls ...).

Why is this needed:

So contributors can understand the Kubernetes specific CI environment and how to effectively schedule to it / write prow.k8s.io specific jobs.

/sig testing k8s-infra @kubernetes/sig-k8s-infra-leads @kubernetes/sig-testing-leads


These are really not discoverable:

https://github.com/kubernetes/k8s.io/blob/86089ae44dd87d86fa1a2a651bb0d6f4ceb06270/infra/aws/terraform/prow-build-cluster/terraform.prod.tfvars#L39C32-L39C44

https://github.com/kubernetes/k8s.io/blob/86089ae44dd87d86fa1a2a651bb0d6f4ceb06270/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L101)

Along with "what is the trusted cluster" etc.

We should also deprecate out the eks-job-migration doc and associated job report results, and we should consider how to balance scheduling to EKS/GKE more generally now that the budgets are similar and all the workloads are running in community accounts. (And also how to approach Azure with the much smaller budget ...)

BenTheElder avatar Jan 13 '25 23:01 BenTheElder

We should also link out to https://monitoring-eks.prow.k8s.io/?orgId=1 https://monitoring-gke.prow.k8s.io/?orgId=1 for checking actual resource usage

BenTheElder avatar Jan 15 '25 19:01 BenTheElder

@kubernetes/sig-k8s-infra-leads how far are we from being able to tell CI users about the CI cluster machine shapes?

BenTheElder avatar Feb 26 '25 19:02 BenTheElder

I know we wanted to reconsider the EKS machine types, and also on GCP it's not clear if highmem actually makes sense with our current workloads, though any changes would have to be done carefully to avoid breaking jobs that already implicitly depend on the machine sizes.

BenTheElder avatar Feb 26 '25 19:02 BenTheElder

@kubernetes/sig-k8s-infra-leads how far are we from being able to tell CI users about the CI cluster machine shapes?

I think once we are done with instance type selection. We can also communicate what we already have and later update the docs once the instances are switched.

ameukam avatar Feb 26 '25 19:02 ameukam

@xmudrii @upodroid ?

This is coming up all the time, we should decide and either commit to documenting the EKS cluster as-is, or move on changing it.

It's really difficult to answer these questions about resources available, which cluster to use, etc. currently.

BenTheElder avatar Mar 26 '25 18:03 BenTheElder

or move on changing it.

I'll push on that right after KubeCon, too busy at the moment. 🙃

xmudrii avatar Mar 26 '25 23:03 xmudrii

I recommend the following:

We introduce the following labels to pick the size of the pod for the community users:

prow.k8s.io/machine-size: small - 2 cores 8 GB of memory prow.k8s.io/machine-size: medium - 4 cores and 16GB of memory prow.k8s.io/machine-size: large - 7 cores and 32GB of memory

If the pod requires node isolation, we can also add this label:

prow.k8s.io/dedicated: true|false

We can use kyverno to mutate pods by:

  1. Inserting the resource limits to match the specs we publish
  2. Inserting the pod affinity rules to ensure the prow pod occupies a single node without other build pods.

https://github.com/kubernetes/k8s.io/blob/main/kubernetes/ibm-ppc64le/prow/kyverno.yaml

typical CI machines sizes https://docs.gitlab.com/ci/runners/hosted_runners/linux/#machine-types-available-for-linux---x86-64 https://cloud.google.com/build/pricing https://circleci.com/pricing/price-list/

The instance types now become something internal to k8s infra and we can pick the following instances:

  • GKE

    • c4-highmem-8 for amd64 pods
    • c4a-highmem-8 for arm64 pods
  • EKS

    • r7a.2xlarge for amd64 pods
    • TBD for arm64 pods

upodroid avatar Mar 30 '25 13:03 upodroid

@upodroid seems reasonable to me! thanks

dims avatar Mar 30 '25 13:03 dims

I don't generally recommend depending on mutating admission webhooks ... that part gives me pause

BenTheElder avatar Mar 31 '25 21:03 BenTheElder

Probably best to follow up on this when 1.35 opens up, or else just document the current state.

The longer we don't change the clusters the more we might as well document them as-is.

BenTheElder avatar Jul 23 '25 20:07 BenTheElder

At this time it's the best to wait for 1.35, by then prepare the changes, and then apply them shortly after the Code Thaw.

xmudrii avatar Jul 23 '25 20:07 xmudrii

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 21 '25 21:10 k8s-triage-robot

/remove-lifecycle stale

xmudrii avatar Oct 21 '25 22:10 xmudrii