k8s.io eks-prow-build-cluster: Reconsider instance type selection

What should be cleaned up or changed:

Some changes were made to the EKS cluster to attempt to resolve an issue with test flakes. These changes also increased the per-node cost. We should consider reverting these changes to reduce cost.

a) Changing to an instance type without instance storage.

b) Changing back to an AMD CPU type

c) Changing to a roughly 8 CPU / 64GB type to more closely match the existing GCP cluster nodes

The cluster currently uses an r5d.4xlarge (16 CPU/ 128 GB) with an on-demand cost of 1.152

An r5a.4xlarge (16 CPU / 128 GB) has an on-demand cost of 0.904 per hour

An r5a.2xlarge (8 CPU / 64 GB) has an on-demand cost of 0.45 per hour

Provide any links for context:

Mar 30 '23 13:03 tzneal

/sig k8s-infra

Mar 30 '23 14:03 tzneal

I'm going to transfer this issue to k/k8s.io as other issues related to this cluster are already there. /transfer-issue k8s.io

Apr 02 '23 20:04 xmudrii

/assign @xmudrii @pkprzekwas

Apr 02 '23 20:04 xmudrii

One thing to consider: Because kubernetes doesn't have IO/IOPS isolation, sizing really large nodes changes the CPU : I/O ratio. (Though this will also not be 1:1 between GCP and AWS anyhow), so while really large nodes can allow high core count jobs OR bin packing more jobs per node ... the latter can cause issues by over-packing for I/O throughput.

This is less of an issue today than when we ran bazel builds widely, but it's still something that can cause performance issues. The existing size is semi-arbitrary though, and may be somewhat GCP specific, but right now tests that are likely to be IO heavy sometimes reserve that IO by reserving ~all of the CPU at our current node sizes.

Apr 02 '23 20:04 BenTheElder

xref #4686

Apr 02 '23 20:04 xmudrii

To add to what @BenTheElder said: we already had issues with GOMAXPROCS for unit tests. We "migrated" 5 jobs so far and one was affected (potentially one more). To avoid such issues, we might want to have instances close to what we have on GCP. We can't have 1:1 mapping, but we can try using similar instances based on what AWS offers.

Not having to deal with stuff such as GOMAXPROCS is going to make the migration more smooth and we'll avoid spending a lot of time on debugging such issues.

Apr 02 '23 20:04 xmudrii

@xmudrii fyi https://github.com/kubernetes/kubernetes/pull/117016

Apr 02 '23 20:04 dims

@dims Thanks for driving this forward. But just to note, this fixes it only for k/k, other subprojects might be affected by it and would need to apply a similar patch.

Apr 02 '23 20:04 xmudrii

Go is expected to solve GOMAXPROCS upstream, it's been accepted to detect this in the stdlib, and GOMAXPROCS can also be set in the CI in the meantime, as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

Apr 02 '23 20:04 BenTheElder

as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

+1 for setting this on existing jobs. I have a secret hope that it might generally reduce flakiness a bit.

Apr 03 '23 02:04 tzneal

Maybe try some bare metal node like an m5.2xlarge or m6g.2xlarge?

Apr 03 '23 14:04 TerryHowe

@TerryHowe We need to use memory optimized instances because our jobs tend to use a lot of memory.

Apr 03 '23 15:04 xmudrii

Update: we decided to go with a 3 step phased approach:

Switch from r5d.4xlarge to r6id.2xlarge (this instance size should be very close to what we have on GCP)
Switch from r6id.2xlarge to r6i.2xlarge (i.e. switch from SSDs to EBS)
Switch from r6i.2xlarge to r6a.2xlarge (i.e. switch to AMD CPUs)

Note: the order of phases might get changed.

Each phase should last at least 24 hours to ensure that tests are stable. I just started the first phase and I think we should leave it on until Wednesday morning CEST.

Apr 03 '23 15:04 xmudrii

Update: we tried r6id.2xlarge but it seems that 8 vCPUs are not enough:

  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   44s   default-scheduler   0/20 nodes are available: 20 Insufficient cpu. preemption: 0/20 nodes are available: 20 No preemption victims found for incoming pod.
  Normal   NotTriggerScaleUp  38s   cluster-autoscaler  pod didn't trigger scale-up: 1 Insufficient cpu

I'm trying r5ad.4xlarge instead.

Apr 03 '23 16:04 xmudrii

/retitle eks-prow-build-cluster: Reconsider instance type selection

Apr 25 '23 15:04 xmudrii

@xmudrii are we still doing this ? Do we want to use a instance type with less resources ?

Nov 15 '23 08:11 ameukam

@ameukam I would still like to take a look into this, but we'd mostly like need to adopt Karpenter to be able to do this (#5168) /lifecycle frozen

Nov 15 '23 09:11 xmudrii

Blocked by #5168 /unassign @xmudrii @pkprzekwas

Feb 12 '24 16:02 xmudrii