test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Migrate away from n1-machine-{1-4} for k8s e2e testing

Open kannon92 opened this issue 6 months ago • 8 comments

N1 machines are slower and the suggestion is to start migrating the machines to use e2. These are faster and more available so this should help mitigate stock outage errors.

kannon92 avatar Jun 04 '25 21:06 kannon92

cc @BenTheElder @ameukam

/sig node

kannon92 avatar Jun 04 '25 21:06 kannon92

/sig testing

SergeyKanzhelev avatar Jun 04 '25 21:06 SergeyKanzhelev

@BenTheElder who is the best person to decide on this? Ultimately for the most tests we do not care of a VM shape

SergeyKanzhelev avatar Jun 04 '25 21:06 SergeyKanzhelev

/sig k8s-infra @upodroid was looking into these trade-offs recently.

IMHO, we should move to newer machine types over time unless it's way more expensive.

We can also move to more expensive when we are getting a bit performance win ... e.g. the CI cluster is moving to C4 and C4D because they will let things like builds go a fair bit faster (20-25% for some workloads IIRC?) but I wouldn't move everything yet because there are some availability issues and we can keep some warm minimum scaled capacity there but e2e VMs cannot.

We also should probably be careful with jobs like scale tests where giving the test different machines might impact detecting performance regressions.

For workloads that aren't very sensitive, e2 seems reasonable: https://cloud.google.com/compute/docs/machine-resource#recommendations_for_machine_types

BenTheElder avatar Jun 04 '25 22:06 BenTheElder

We should fix the node gce launcher to do the following:

  • pass a region instead of a zone and also a list of machine types that you want to use
  • us-central1 should be preferred CI region as it's closest to the CI cluster and also has the most types of VMs available. Its also one of the largest regions in GCP
  • it should attempt to launch the instance in a random zone of the region with the first machine type on the list, if it fails attempt to launch the VM in a different zone with the same machine type and then retry the whole step again with a different machine type.
  • after 4 attempts, we should consider the job failed and return the current error

upodroid avatar Jun 05 '25 19:06 upodroid

Having a priority list of VM shapes, zones, or even regions sounds good. I don't care if tests will run in asia or europe.

One thing is we need to have a clear and easy way to check in logs what was finally picked. Even better if this will be a metadata for the triage tool. So flakes can be grouped by this machine type or zone.

SergeyKanzhelev avatar Jun 05 '25 20:06 SergeyKanzhelev

/cc

ffromani avatar Jun 25 '25 17:06 ffromani

@BenTheElder @upodroid any ideas on the next steps? Should it be done across all sigs, not just sig node?

SergeyKanzhelev avatar Jun 25 '25 17:06 SergeyKanzhelev

I would like to help/work on this If I get it right the works would need to be done in kubetest2-gce deployer or am i missing something? @upodroid @BenTheElder

elieserr avatar Jul 06 '25 19:07 elieserr

You can start by trying to replace all instances of n1-standard-* in this repository. Make the changes, merge it and then keep an eye on the jobs for the first few runs.

upodroid avatar Jul 06 '25 19:07 upodroid

Will have to watch out for this:

https://cloud.google.com/compute/docs/general-purpose-machines#e2_machine_types

Doesn't support GPUs, Local SSDs, sole-tenant nodes, or nested virtualization.

https://github.com/kubernetes/test-infra/pull/35126

BenTheElder avatar Jul 11 '25 18:07 BenTheElder

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 09 '25 19:10 k8s-triage-robot

I think this is complete.

kannon92 avatar Nov 05 '25 19:11 kannon92