Migrate away from n1-machine-{1-4} for k8s e2e testing
N1 machines are slower and the suggestion is to start migrating the machines to use e2. These are faster and more available so this should help mitigate stock outage errors.
cc @BenTheElder @ameukam
/sig node
/sig testing
@BenTheElder who is the best person to decide on this? Ultimately for the most tests we do not care of a VM shape
/sig k8s-infra @upodroid was looking into these trade-offs recently.
IMHO, we should move to newer machine types over time unless it's way more expensive.
We can also move to more expensive when we are getting a bit performance win ... e.g. the CI cluster is moving to C4 and C4D because they will let things like builds go a fair bit faster (20-25% for some workloads IIRC?) but I wouldn't move everything yet because there are some availability issues and we can keep some warm minimum scaled capacity there but e2e VMs cannot.
We also should probably be careful with jobs like scale tests where giving the test different machines might impact detecting performance regressions.
For workloads that aren't very sensitive, e2 seems reasonable: https://cloud.google.com/compute/docs/machine-resource#recommendations_for_machine_types
We should fix the node gce launcher to do the following:
- pass a region instead of a zone and also a list of machine types that you want to use
- us-central1 should be preferred CI region as it's closest to the CI cluster and also has the most types of VMs available. Its also one of the largest regions in GCP
- it should attempt to launch the instance in a random zone of the region with the first machine type on the list, if it fails attempt to launch the VM in a different zone with the same machine type and then retry the whole step again with a different machine type.
- after 4 attempts, we should consider the job failed and return the current error
Having a priority list of VM shapes, zones, or even regions sounds good. I don't care if tests will run in asia or europe.
One thing is we need to have a clear and easy way to check in logs what was finally picked. Even better if this will be a metadata for the triage tool. So flakes can be grouped by this machine type or zone.
/cc
@BenTheElder @upodroid any ideas on the next steps? Should it be done across all sigs, not just sig node?
I would like to help/work on this If I get it right the works would need to be done in kubetest2-gce deployer or am i missing something? @upodroid @BenTheElder
You can start by trying to replace all instances of n1-standard-* in this repository. Make the changes, merge it and then keep an eye on the jobs for the first few runs.
Will have to watch out for this:
https://cloud.google.com/compute/docs/general-purpose-machines#e2_machine_types
Doesn't support GPUs, Local SSDs, sole-tenant nodes, or nested virtualization.
https://github.com/kubernetes/test-infra/pull/35126
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I think this is complete.