perf-tests Simplify load test

With migration to CL2 being roughly done, I would like to simplify and evolve load test a bit to:

create a service in a bundle with RC
that creates a problem that we currently have a service only for half of RCs, but fortunately in new enough releases we can solve that by using the mechanism introduced here: https://github.com/kubernetes/kubernetes/pull/71355
switch to using Deployment instead of RC

/assign @krzysied

Feb 13 '19 12:02 wojtek-t

@kubernetes/sig-scalability-feature-requests

Feb 13 '19 12:02 wojtek-t

The problem I see here is that running this simplified load test for older k8s versions will not work. I think we need to split tests into legacy/ head/. This change will also be required when we switch to new api responsiveness metric. Wdyt?

[x] Creating legacy and head directories with copies of test configs #450
[x] Changing scalability jobs config
[x] Removing all test configs

Mar 05 '19 11:03 krzysied

I'm fine with legacy/ and head/ for other reasons (e.g. responsiveness metric).

But I don't understand why you need it for the purpose of this issue?

Mar 11 '19 11:03 wojtek-t

IIUC we want to use kubernetes/kubernetes#71355 which will not be available for older k8s version, right?

Mar 11 '19 11:03 krzysied

Ahh - that's what you meant. Yeah, though given that we only run older releases on 100-node tests, even if that would have no effect, it should still be fine...

Mar 11 '19 12:03 wojtek-t

Actually, I think that PR that was submitted wasn't exactly what we want. We should create head/ and legacy/ subdirectories within individual tests, as needed.

Or rather, I would say we should even go one step further:

create a legacy/ directory (within tests that we need to modify, so probably both actually)
migrate old branches to use configs from that directory
leave everything else as is Once this is done, we may start modifying the current configs.

Apr 10 '19 11:04 wojtek-t

Point (2) from original description is mostly done in #488

However, as discussed offline with @krzysied while this fixes kube-proxy issues, it still leads to much higher resource usage of DNS which leads to OOMs. So in order to make this change, we would need to adjust DNS resource limits based on size of the cluster somehow.

So the remaining stuff to do is:

adjust DNS resource limits
bundle service together with RC
switch from RC to Deployment

Apr 11 '19 11:04 wojtek-t

With https://github.com/kubernetes/kubernetes/pull/77918 adjusting DNS resource limits will be possible.

@krzysied - once we get out of regressions, it's an important thing to look at.

May 17 '19 14:05 wojtek-t

@mm4tt - you did a part of it (which is switching to deployments).

Would you mind taking the other parts to (adjust dns limits and bundle service together with Deployment)? [that would make it clear other bundling later too, like secrets or configmaps]

/assign @mm4tt

Jul 15 '19 13:07 wojtek-t

Sure thing, will take it over from here.

Jul 15 '19 13:07 mm4tt

I've run the experimental load config on 5K node cluster over weekend. Some observations below

CoreDNS memory went up (expected, as we have more services): Baseline Simplified Load
Network traffic went significantly down during delete deployment phase (expected, as we send much less endpoints objects updates due to the fact that services are deleted together with deployments), it also made the api-server cpu usage visibly less spiky Baseline Simplified Load

Jul 29 '19 11:07 mm4tt

API Call Latency SLI of POST/DELETE Services went significantly up (and it's considered a blocker for enabling simplified load in the current form). The increase in latency seems to be coming from etcd, it looks like it's getting overloaded (which may be related to https://github.com/kubernetes/kubernetes/issues/80209), in the experimental run the latency of operations is much higher than in the baseline where services are created before/after deployments (and load on etcd is much lower). The important thing is that Service objects are unique in k8s, creating/deleting them requires 3 etcd writes (Service and two RangeAllocators). Increase in etcd latency is very strongly reflected in api call latency for Services. Overall it ruins the Api-Call-Latency SLO for DELETE (both prometheus and old measurement) and POST (old measurement seems fine, but prometheus is above 1s). Baseline Simplified Load

Jul 29 '19 11:07 mm4tt

Given 3, this is currently blocked. We can get back to it once etcd performance gets improved (e.g. https://github.com/kubernetes/kubernetes/issues/80209 gets fixed).

Jul 29 '19 11:07 mm4tt

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Oct 27 '19 11:10 fejta-bot

/remove-lifecycle stale /lifecycle frozen

Oct 27 '19 12:10 wojtek-t

perf-tests perf-tests copied to clipboard

Simplify load test

perf-tests
perf-tests copied to clipboard