perf-tests icon indicating copy to clipboard operation
perf-tests copied to clipboard

Re-implement scheduling-throughput measurement

Open mm4tt opened this issue 5 years ago • 44 comments

Currently, we have a very crude logic for computing scheduling-throughput. We list all pods every 5s and see how many new pod very scheduled in that time. It doesn't work well in small tests where number of pods is low as we only get 1-2 representative 5s windows.

We should rewrite it to make it more reliable and accurate, e.g. use prometheus query to monitor rate of pod bindings

mm4tt avatar Feb 06 '20 10:02 mm4tt

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar May 06 '20 11:05 fejta-bot

/remove-lifecycle stale

wojtek-t avatar May 06 '20 11:05 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Aug 04 '20 11:08 fejta-bot

/remove-lifecycle stale

wojtek-t avatar Aug 04 '20 11:08 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Nov 02 '20 11:11 fejta-bot

/remove-lifecycle stale

/assign @marseel

Since this is already happening - @marseel can you please update the issue with info what work is remaining to do?

wojtek-t avatar Nov 02 '20 17:11 wojtek-t

@wojtek-t: GitHub didn't allow me to assign the following users: marseel.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/remove-lifecycle stale

/assign @marseel

Since this is already happening - @marseel can you please update the issue with info what work is remaining to do?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 02 '20 17:11 k8s-ci-robot

/assign

marseel avatar Nov 12 '20 16:11 marseel

Currently, new prometheus metric is already implemented and data is collected for all tests in master branch. It's worth to mention that previous SchedulingThroughput allowed us to gather metric with interval 1s, based on Prometheus we are only able to gather with interval 5s. The only thing remaining is turning on alerting for specific tests and removing old SchedulingThroughput. I've postponed this step, because it might have been conflicting with merging density with load test.

marseel avatar Nov 12 '20 16:11 marseel

only able to gather with interval 5s.

Unless we start scraping metrics more/less often. We may consider scraping more often potentially too.

I've postponed this step, because it might have been conflicting with merging density with load test.

Which is the right decision :) Though we're almost done with this merging, so we will be able to get back to it.

wojtek-t avatar Nov 12 '20 18:11 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Feb 10 '21 19:02 fejta-bot

/remove-lifecycle stale

wojtek-t avatar Feb 10 '21 19:02 wojtek-t

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar May 11 '21 19:05 fejta-bot

/remove-lifecycle stale

wojtek-t avatar May 12 '21 07:05 wojtek-t

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 10 '21 08:08 k8s-triage-robot

/remove-lifecycle stale

wojtek-t avatar Aug 10 '21 08:08 wojtek-t

Hi @marseel Do you know if the old SchedulingThroughput was ever removed or is it still active? I still the source code in the main branch.

The only thing remaining is turning on alerting for specific tests and removing old SchedulingThroughput. I've postponed this step, because it might have been conflicting with merging density with load test.

dmatch01 avatar Oct 08 '21 14:10 dmatch01

Old SchedulingThroughput wasn't ever removed unfortunately. Currently, we are scraping metrics every 5 seconds for Prometheus in new measurement and old SchedulingThroughput has a better precision. I'm pretty sure it's possible to replace it, we just need to be really careful. Unfortunately I didn't have much time to work on it.

marseel avatar Oct 08 '21 14:10 marseel

Thanks @marseel for the quick response. Just to make sure I understand, the scheduling_throughput.go code is still being used to produce SchedulingThroughput metrics, is that correct? Regarding Prometheus, could you provide more details on how, if any, Prometheus is being used to produce the current SchedulingThroughput metrics? Thank you

Old SchedulingThroughput wasn't ever removed unfortunately. Currently, we are scraping metrics every 5 seconds for Prometheus in new measurement and old SchedulingThroughput has a better precision. I'm pretty sure it's possible to replace it, we just need to be really careful. Unfortunately I didn't have much time to work on it.

dmatch01 avatar Oct 08 '21 14:10 dmatch01

Currently, our load test measures scheduling throughput using both methods . For example for test run https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1446158765446402048/ you can find both measurements' results:

  • "Old" scheduling_throughput.go - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1446158765446402048/artifacts/SchedulingThroughput_load_2021-10-07T18:24:28Z.json
  • "New" SchedulingThroughput based on Prometheus metrics: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1446158765446402048/artifacts/SchedulingThroughputPrometheus_load_2021-10-07T20:16:40Z.json

marseel avatar Oct 08 '21 14:10 marseel

I see. It is interesting to see that max is different between the two reports. I have a pretty good understanding of the "Old" SchedulingThroughput after walking through the code. I'll look at the "New" one from Prometheus to understand the details. If you would like for me to share my findings with you please let me know the forum you prefer, e.g. here in this Issue or in Slack, DM or SIG channel. Thanks!

dmatch01 avatar Oct 08 '21 15:10 dmatch01

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 06 '22 16:01 k8s-triage-robot

/remove-lifecycle stale

wojtek-t avatar Jan 10 '22 08:01 wojtek-t

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 10 '22 08:04 k8s-triage-robot

/remove-lifecycle stale

marseel avatar Apr 11 '22 08:04 marseel

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 10 '22 08:07 k8s-triage-robot

/remove-lifecycle stale

wojtek-t avatar Jul 11 '22 05:07 wojtek-t

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 09 '22 06:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 08 '22 07:11 k8s-triage-robot

/remove-lifecycle rotten

wojtek-t avatar Nov 21 '22 09:11 wojtek-t