perf-tests Re-implement scheduling-throughput measurement

Currently, we have a very crude logic for computing scheduling-throughput. We list all pods every 5s and see how many new pod very scheduled in that time. It doesn't work well in small tests where number of pods is low as we only get 1-2 representative 5s windows.

We should rewrite it to make it more reliable and accurate, e.g. use prometheus query to monitor rate of pod bindings

Feb 06 '20 10:02 mm4tt

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

May 06 '20 11:05 fejta-bot

/remove-lifecycle stale

May 06 '20 11:05 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Aug 04 '20 11:08 fejta-bot

/remove-lifecycle stale

Aug 04 '20 11:08 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Nov 02 '20 11:11 fejta-bot

/remove-lifecycle stale

/assign @marseel

Since this is already happening - @marseel can you please update the issue with info what work is remaining to do?

Nov 02 '20 17:11 wojtek-t

@wojtek-t: GitHub didn't allow me to assign the following users: marseel.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/remove-lifecycle stale

/assign @marseel

Since this is already happening - @marseel can you please update the issue with info what work is remaining to do?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 02 '20 17:11 k8s-ci-robot

/assign

Nov 12 '20 16:11 marseel

Currently, new prometheus metric is already implemented and data is collected for all tests in master branch. It's worth to mention that previous SchedulingThroughput allowed us to gather metric with interval 1s, based on Prometheus we are only able to gather with interval 5s. The only thing remaining is turning on alerting for specific tests and removing old SchedulingThroughput. I've postponed this step, because it might have been conflicting with merging density with load test.

Nov 12 '20 16:11 marseel

only able to gather with interval 5s.

Unless we start scraping metrics more/less often. We may consider scraping more often potentially too.

I've postponed this step, because it might have been conflicting with merging density with load test.

Which is the right decision :) Though we're almost done with this merging, so we will be able to get back to it.

Nov 12 '20 18:11 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Feb 10 '21 19:02 fejta-bot

/remove-lifecycle stale

Feb 10 '21 19:02 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

May 11 '21 19:05 fejta-bot

/remove-lifecycle stale

May 12 '21 07:05 wojtek-t

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 10 '21 08:08 k8s-triage-robot

/remove-lifecycle stale

Aug 10 '21 08:08 wojtek-t

Hi @marseel Do you know if the old SchedulingThroughput was ever removed or is it still active? I still the source code in the main branch.

The only thing remaining is turning on alerting for specific tests and removing old SchedulingThroughput. I've postponed this step, because it might have been conflicting with merging density with load test.

Oct 08 '21 14:10 dmatch01

Old SchedulingThroughput wasn't ever removed unfortunately. Currently, we are scraping metrics every 5 seconds for Prometheus in new measurement and old SchedulingThroughput has a better precision. I'm pretty sure it's possible to replace it, we just need to be really careful. Unfortunately I didn't have much time to work on it.

Oct 08 '21 14:10 marseel

Thanks @marseel for the quick response. Just to make sure I understand, the scheduling_throughput.go code is still being used to produce SchedulingThroughput metrics, is that correct? Regarding Prometheus, could you provide more details on how, if any, Prometheus is being used to produce the current SchedulingThroughput metrics? Thank you

Old SchedulingThroughput wasn't ever removed unfortunately. Currently, we are scraping metrics every 5 seconds for Prometheus in new measurement and old SchedulingThroughput has a better precision. I'm pretty sure it's possible to replace it, we just need to be really careful. Unfortunately I didn't have much time to work on it.

Oct 08 '21 14:10 dmatch01

Currently, our load test measures scheduling throughput using both methods . For example for test run https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1446158765446402048/ you can find both measurements' results:

"Old" scheduling_throughput.go - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1446158765446402048/artifacts/SchedulingThroughput_load_2021-10-07T18:24:28Z.json
"New" SchedulingThroughput based on Prometheus metrics: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1446158765446402048/artifacts/SchedulingThroughputPrometheus_load_2021-10-07T20:16:40Z.json

Oct 08 '21 14:10 marseel

I see. It is interesting to see that max is different between the two reports. I have a pretty good understanding of the "Old" SchedulingThroughput after walking through the code. I'll look at the "New" one from Prometheus to understand the details. If you would like for me to share my findings with you please let me know the forum you prefer, e.g. here in this Issue or in Slack, DM or SIG channel. Thanks!

Oct 08 '21 15:10 dmatch01

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 06 '22 16:01 k8s-triage-robot

/remove-lifecycle stale

Jan 10 '22 08:01 wojtek-t

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 10 '22 08:04 k8s-triage-robot

/remove-lifecycle stale

Apr 11 '22 08:04 marseel

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 10 '22 08:07 k8s-triage-robot

/remove-lifecycle stale

Jul 11 '22 05:07 wojtek-t

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 09 '22 06:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 08 '22 07:11 k8s-triage-robot

/remove-lifecycle rotten

Nov 21 '22 09:11 wojtek-t

perf-tests perf-tests copied to clipboard

Re-implement scheduling-throughput measurement

perf-tests
perf-tests copied to clipboard