test-infra Optimize build cluster performance

This issue aims to track all things related to optimizing our build cluster performance.

We have done a lot of work to reduce test flakes, but we still seem them relatively often. In a large number of cases, these appear to occur when things that should always succeed fail for reasons outside of poorly written tests or buggy Istiocode. For example, simple HTTP requests timing out after many seconds.

We have had two similar issues in the past:

https://github.com/istio/test-infra/issues/1988 was caused by not properly cleaning up resources, leading to a ton of resources running in the cluster over time. This was fixed by ensuring we clean up (through many different mechanisms)
https://github.com/istio/istio/issues/32985 jobs suddenly hang A LOT. echo takes over 60s in some cases. Triggered by a node upgrade in GKE. We switch from ubuntu to COS to mitigate this. Root cause unknown to date.

Current state:

Tests often fail for reasons that are likely explained by node performance (IE trivial command is throttled heavily for N seconds, and test is not robust against this). While we expect our tests to be robust against this to some degree, it appears N is sometimes extremely large. For example, we have a lot of tests that send 5 requests and expect all 5 to succeed, with many retries, with a 30s timeout. These fail relatively often.
We have a metric that captures the time it takes to run echo. On a health machine, this should, of course, take near 0ms. We often see this spike, correlated with increased CPU usage.

Top shows grouped by node type, bottom all nodes. You can see spikes up to 2.5s. Note: the node type graph is likely misleading; we have a small fixed number of n2/t2d nodes but a large dynamic number of e2 nodes. This means there are more samples for e2 AND it has more cache misses.

Things to try:

[x] Setting CPU limits: https://github.com/istio/test-infra/commit/9dadd370e10ebec3d85878f0b2f890c455838dd2. No tangible improvements in any metric
[ ] Guarantee QOS test pods (superset of CPU limits)
[ ] kubelet static CPU policy (superset of Guaranteed QOS)
[ ] Running other node types (n2, t2d). Currently trialing this. No conclusive data.
[ ] Using local SSDs. Currently we run 512/256gb pd-ssd. There is evidence we are IO bound in some portion of tests - graphs show our bandwidth is often at the cap, and we do see up to 8mb/s write throttling. However, there is no evidence that removing the bottleneck would change test results; most of our tests are not IO bound. kind etcd runs in tmpfs and should be unimpacted. Local SSD are actually cheaper and far faster, however they require n2 nodes.
[x] Increasing CPU requests on some jobs. https://github.com/istio/test-infra/commit/d28ae63a6d1e026502ef2a12423380e4c40d2525 and https://github.com/istio/test-infra/commit/3a0765c2cd16f36b29d813fd99308d66eb42726d put the most expensive ones at 15 CPUs, ensuring dedicate nodes. Since this change, unit test runtime has dropped substantially, but there is not strong evidence yet that it impacts other tests flakiness.
[ ] Build once, test in many places. Currently we build all docker images N times, and some test binaries N times. This is fairly expensive even with a cache. it would be ideal to build once - possibly on some giant nodes - and then just run the tests locally. This is likely a massive effort.

Feb 28 '22 18:02 howardjohn

Starting to look at co-located jobs during flakes

Flake: https://prow.istio.io/view/gs/istio-prow/logs/integ-k8s-120_istio_postsubmit/1498955294602956800 mysterious 141 error. OOM? At 2022-03-02T09:40:43.788454Z

Co-located with integ-pilot (started at same time) and integ-assertion (deep into run).

Total memory by all 3 is only 12gb, not very concerning

https://pantheon.corp.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22kubernetes.io%2Fcontainer%2Fcpu%2Fcore_usage_time%5C%22%20resource.type%3D%5C%22k8s_container%5C%22%20metadata.system_labels.%5C%22node_name%5C%22%3D%5C%22gke-prow-istio-test-pool-cos-1104557f-xmm6%5C%22%20resource.label.%5C%22cluster_name%5C%22%3D%5C%22prow%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22metadata.user_labels.%5C%22prow.k8s.io%2Fid%5C%22%22,%22metadata.user_labels.%5C%22prow.k8s.io%2Fjob%5C%22%22%5D%7D,%7B%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%221w%22%7D,%22xZoomDomain%22:%7B%22start%22:%222022-03-02T05:26:15.292Z%22,%22end%22:%222022-03-02T12:34:36.558Z%22%7D%7D&project=istio-prow-build

https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1498866650949095424

TestReachability/global-plaintext/b_in_primary/tcp_to_headless:tcp_positive failure at 2022-03-02T04:10:57.231356Z

Co-located with integ-k8s-119 that started at the same time. It was using near zero cpu at the time of the test failure - it was literally doing nothing ( a bug of its own )

https://pantheon.corp.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22kubernetes.io%2Fcontainer%2Fcpu%2Fcore_usage_time%5C%22%20resource.type%3D%5C%22k8s_container%5C%22%20metadata.system_labels.%5C%22node_name%5C%22%3D%5C%22gke-prow-istio-test-pool-cos-1104557f-qnxm%5C%22%20resource.label.%5C%22cluster_name%5C%22%3D%5C%22prow%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22metadata.user_labels.%5C%22prow.k8s.io%2Fid%5C%22%22,%22metadata.user_labels.%5C%22prow.k8s.io%2Fjob%5C%22%22%5D%7D,%7B%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%221d%22%7D,%22xZoomDomain%22:%7B%22start%22:%222022-03-02T03:42:26.942Z%22,%22end%22:%222022-03-02T05:25:18.371Z%22%7D%7D&project=istio-prow-build

Mar 02 '22 16:03 howardjohn

https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1498793275824279552 double failure!

TestReachability/beta-mtls-permissive/b_in_primary/tcp_to_b:tcp_positive at 2022-03-01T23:10:08.013392Z TestMtlsStrictK8sCA/global-mtls-on-no-dr/b_in_remote/tcp_to_a:tcp_positive at 2022-03-01T23:18:19.563414Z

Co-scheduled with distroless job that started after. Disrtoless job peaks in CPU from 23:00 but is done by 23:05 - way before the failures.

Also co-scheduled with helm test. This one runs from 23:10 to 23:16. So it really shouldn't be overlapping with either of the failures - it is close though

So in 3 cases I looked at, coscheduling doesn't seem to be related.

Mar 02 '22 16:03 howardjohn