perf-tests icon indicating copy to clipboard operation
perf-tests copied to clipboard

Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies

Open mm4tt opened this issue 6 years ago • 8 comments

Implemented

  • [x] Deployments
  • [x] DaemonSets
  • [x] StatefulSets -
  • [x] Jobs
  • [x] Persistent Volumes
  • [x] Secrets
  • [x] ConfigMaps
  • [ ] NetworkPolicies - WIP https://github.com/kubernetes/perf-tests/pull/719, https://github.com/kubernetes/test-infra/pull/13709

Enabled in CI/CD

  • [x] Deployments
  • [ ] DaemonSets
  • [x] StatefulSets
  • [x] Jobs
  • [ ] Persistent Volumes
  • [x] Secrets
  • [x] ConfigMaps
  • [ ] NetworkPolicies

mm4tt avatar Jul 29 '19 17:07 mm4tt

/assign

mm4tt avatar Jul 29 '19 17:07 mm4tt

I've run a 5K node test yesterday, using the extended load scenario with enabled: Secrets, ConfigMaps, StatefulSets and PVs. The test passed, but the new prometheus-based api-call latency measurement failed for a few tuples:

W0731 02:58:04.461] I0731 02:58:04.459614   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:persistentvolumes Subresource: Verb:POST Scope:cluster Latency:perc50: 266.666666ms, perc90: 1.375s, perc99: 2.443099999s Count:553}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459626   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:perc50: 32.018464ms, perc90: 143.243359ms, perc99: 1.663881526s Count:17274382}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459643   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:services Subresource: Verb:DELETE Scope:namespace Latency:perc50: 1.215999999s, perc90: 1.472s, perc99: 1.4972s Count:8251}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459655   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:namespaces Subresource: Verb:GET Scope:cluster Latency:perc50: 26.875234ms, perc90: 48.375422ms, perc99: 1.14s Count:5228}; threshold: 1s
W0731 02:58:04.462] I0731 02:58:04.459668   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:configmaps Subresource: Verb:POST Scope:namespace Latency:perc50: 28.571428ms, perc90: 88.1ms, perc99: 1.02472s Count:18627}; threshold: 1s

I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases: ** POST PersistentVolumes** ZAam2FaceSR

** DELETE Services ** eUYWOTEnC5S

This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). @oxddr and @krzysied are working on this on hopefully we'll have some solution soon.

So to summarize, the extended load looks very promising. Once we have a solution to the spike problems in Prometheus api-call-latency we should be good (or really close to) to enable it in CI/CD.

mm4tt avatar Jul 31 '19 10:07 mm4tt

After discussing with team we agreed that we should be good to enable Secrets and ConfigMaps in CI/CD tests.

On the other hand it might be tricky, as currently we have a separate experimental config for extended load. I think it might be easier to first implement everything there, then move it out of experimental directory and make it the default load config, and then gradually enable new objects via overrides.

mm4tt avatar Jul 31 '19 11:07 mm4tt

I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases:

(...)

This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). janluk and Krzysztof Siedlecki are working on this on hopefully we'll have some solution soon.

For the record, the fact that Prometheus-based API call latency SLO was violated can be actually valid. But I understand SLO violations caused by logrotate are orthogonal to changes you did.

Prometheus-based measurement is close to SLO definition and thus is more strict and prone to violations caused by spikes.

oxddr avatar Jul 31 '19 12:07 oxddr

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 25 '19 08:12 fejta-bot

/remove-lifecycle stale

We hope to get back to this in Q1 2020. In general, this is done except Network Policies. NetworkPolicies are also implemented but we need to resolve some issues in Calico before enabling them.

mm4tt avatar Dec 27 '19 09:12 mm4tt

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Mar 26 '20 10:03 fejta-bot

/remove-lifecycle stale /lifecycle frozen

oxddr avatar Mar 26 '20 11:03 oxddr