kubernetes configure pluginMetricsSamplePercent or CycleState.recordPluginMetrics from outside of scheduler

What would you like to be added?

We have pluginMetricsSamplePercent which indicates the percentage of plugin metrics to be sampled. But it's const and we cannot change the value from outside. https://github.com/kubernetes/kubernetes/blob/95e30f66c300c76ce21c0ca0e8bc4bf4a45e028f/pkg/scheduler/scheduler.go#L65

It would be nice if there was a way for this to be set up externally. There are several possible ways to achieve this:

add pluginMetricsSamplePercent to KubeSchedulerConfiguration.
- we need to implement new Option to set pluginMetricsSamplePercent as well.
create new Option to pass a function to set CycleState.recordPluginMetrics in each scheduling cycle.
- default should be rand.Intn(100) < pluginMetricsSamplePercent not to break the current behavior.
Add function to set CycleState.recordPluginMetrics in each scheduling cycle to Scheduler field like NextPod or Error.
just change pluginMetricsSamplePercent to var and expose it to be able to be changed value from other package.
(... do you have other good way?)

/kind feature /sig scheduling

Why is this needed?

In sigs/kube-scheduler-simulator, we want to set pluginMetricsSamplePercent 100 to see the metrics of all scheduling. https://github.com/kubernetes-sigs/kube-scheduler-simulator/issues/60

Also, users may want to increase this value for some detailed/accurate performance measurement or may want to set to 0 when not using metrics at all.

Mar 23 '22 01:03 sanposhiho

@sanposhiho: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 23 '22 01:03 k8s-ci-robot

Per discussion on sig-scheduling slack thread, the motivation for this feature is to correlate and identify the root cause of high latency spike in scheduling latency. The potential outcome of this feature can allow users to deduce suitable combination of configuration, i.e. percentageOfNodes + list of plugins.

For example, consider the two figures below, plugin_duration framework_schedule_duration

where framework schedule duration figure depicts the scheduling latency (1m bucket) of a single attempt, and it captures a spike. However, the plugin duration figure does not capture the relevant spike. This is probably due to the sampling rate being 10%.

Mar 23 '22 22:03 matthewygf

/cc @Huang-Wei

Mar 30 '22 11:03 sanposhiho

ping ~ are we okay to work on this ?

May 06 '22 14:05 matthewygf

I think we need to discuss the way to achieve this since we have multiple options for implementation.

@Huang-Wei Could you please take a look at this? Which way do you think is better?

May 06 '22 14:05 sanposhiho

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 04 '22 15:08 k8s-triage-robot

This requirement of "correlating scheduling latency spikes with metrics" is valid to me. It's just I'm wondering if pluginMetricsSamplePercent is the most efficient way to understand the correlation? Can other existing metrics help? Or, enable profiling?

/remove-lifecycle stale

Aug 04 '22 17:08 Huang-Wei

This requirement of "correlating scheduling latency spikes with metrics" is valid to me. It's just I'm wondering if pluginMetricsSamplePercent is the most efficient way to understand the correlation? Can other existing metrics help? Or, enable profiling?

/remove-lifecycle stale

At the moment, apart from "PluginExecutionDuration", the metrics "SchedulingAlgorithmLatency" and "schedulingLatency" reports back the total latency for a scheduling decision. However, both schedulingAlgorithm and schedulinglatency do not give clear pictures for the plugins related latency. They do, however, provide whether it was the binding step or the algorithm that causes higher latency.

"FrameworkExtensionPointDuration" does tell the extension point total latency, but it may not be granular enough to tell whether a particular plugin causes the latency spike.

I am not too familiar with the enable profiling flag, it seems that enabling it can help for pprof but I am not sure if users would want to enable this in prod setting.

Will investigate "enable profiling "

Overall, I still advocate that allowing users to configure pluginMetricsSamplePercent seems to be the most efficient way. This is especially the case in prod settings, where users have their own custom plugin that call external software to make scheduling decision. Would love to hear more if there is another way of allowing user to have a bit more insight on the plugin latency :)

Aug 07 '22 19:08 matthewygf

@matthewygf I will bring it to the next sig meeting. You're welcome to join, it's this Thursday 10:00 AM PST.

Aug 10 '22 21:08 Huang-Wei

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 08 '22 21:11 k8s-triage-robot

/remove-lifecycle stale

Nov 08 '22 22:11 sanposhiho

@matthewygf do you want to work on it? If so, please /assign it to you. I don't mean to force you to work on here, I can follow it if you want to leave it. I just think it's a good first issue for you to get involved in the kubernetes :)

Nov 09 '22 01:11 sanposhiho

@sanposhiho Thanks for pinging me ! Sure I will get myself to work on it. I have been busy and forgotten about this, nevertheless still an important feature for me and the community I think.

/assign

Nov 11 '22 02:11 matthewygf

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 09 '23 03:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 11 '23 04:03 k8s-triage-robot

/remove-lifecycle rotten

Mar 18 '23 14:03 matthewygf

@sanposhiho @Huang-Wei Sorry for the delay, I have started working on this issue and created a branch locally. I have taken the direction to add add pluginMetricsSamplePercent to KubeSchedulerConfiguration, however there are a couple of things I am unsure of.

If I add the field in KubeSchedulerConfiguration, in the conversion to v1beta3 and v1beta2, do I need to worry about the option being supported ? If so, am I introducing API changes?
Do you think we should add the pluginMetricsSamplePercent per schedulerProfile or only globally setting?

Thank you very much for the help in advance!

Mar 18 '23 14:03 matthewygf

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 16 '23 15:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 16 '23 15:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 19 '24 23:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 19 '24 23:01 k8s-ci-robot

kubernetes kubernetes copied to clipboard

configure pluginMetricsSamplePercent or CycleState.recordPluginMetrics from outside of scheduler

What would you like to be added?

Why is this needed?

kubernetes
kubernetes copied to clipboard