perf-tests
perf-tests copied to clipboard
Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies
Implemented
- [x] Deployments
- [x] DaemonSets
- [x] StatefulSets -
- [x] Jobs
- [x] Persistent Volumes
- [x] Secrets
- [x] ConfigMaps
- [ ] NetworkPolicies - WIP https://github.com/kubernetes/perf-tests/pull/719, https://github.com/kubernetes/test-infra/pull/13709
Enabled in CI/CD
- [x] Deployments
- [ ] DaemonSets
- [x] StatefulSets
- [x] Jobs
- [ ] Persistent Volumes
- [x] Secrets
- [x] ConfigMaps
- [ ] NetworkPolicies
/assign
I've run a 5K node test yesterday, using the extended load scenario with enabled: Secrets, ConfigMaps, StatefulSets and PVs. The test passed, but the new prometheus-based api-call latency measurement failed for a few tuples:
W0731 02:58:04.461] I0731 02:58:04.459614 10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:persistentvolumes Subresource: Verb:POST Scope:cluster Latency:perc50: 266.666666ms, perc90: 1.375s, perc99: 2.443099999s Count:553}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459626 10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:perc50: 32.018464ms, perc90: 143.243359ms, perc99: 1.663881526s Count:17274382}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459643 10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:services Subresource: Verb:DELETE Scope:namespace Latency:perc50: 1.215999999s, perc90: 1.472s, perc99: 1.4972s Count:8251}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459655 10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:namespaces Subresource: Verb:GET Scope:cluster Latency:perc50: 26.875234ms, perc90: 48.375422ms, perc99: 1.14s Count:5228}; threshold: 1s
W0731 02:58:04.462] I0731 02:58:04.459668 10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:configmaps Subresource: Verb:POST Scope:namespace Latency:perc50: 28.571428ms, perc90: 88.1ms, perc99: 1.02472s Count:18627}; threshold: 1s
I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases:
** POST PersistentVolumes**

** DELETE Services **

This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). @oxddr and @krzysied are working on this on hopefully we'll have some solution soon.
So to summarize, the extended load looks very promising. Once we have a solution to the spike problems in Prometheus api-call-latency we should be good (or really close to) to enable it in CI/CD.
After discussing with team we agreed that we should be good to enable Secrets and ConfigMaps in CI/CD tests.
On the other hand it might be tricky, as currently we have a separate experimental config for extended load. I think it might be easier to first implement everything there, then move it out of experimental directory and make it the default load config, and then gradually enable new objects via overrides.
I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases:
(...)
This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). janluk and Krzysztof Siedlecki are working on this on hopefully we'll have some solution soon.
For the record, the fact that Prometheus-based API call latency SLO was violated can be actually valid. But I understand SLO violations caused by logrotate are orthogonal to changes you did.
Prometheus-based measurement is close to SLO definition and thus is more strict and prone to violations caused by spikes.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
We hope to get back to this in Q1 2020. In general, this is done except Network Policies. NetworkPolicies are also implemented but we need to resolve some issues in Calico before enabling them.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale /lifecycle frozen