kubernetes
kubernetes copied to clipboard
configure pluginMetricsSamplePercent or CycleState.recordPluginMetrics from outside of scheduler
What would you like to be added?
We have pluginMetricsSamplePercent
which indicates the percentage of plugin metrics to be sampled. But it's const
and we cannot change the value from outside.
https://github.com/kubernetes/kubernetes/blob/95e30f66c300c76ce21c0ca0e8bc4bf4a45e028f/pkg/scheduler/scheduler.go#L65
It would be nice if there was a way for this to be set up externally. There are several possible ways to achieve this:
- add
pluginMetricsSamplePercent
to KubeSchedulerConfiguration.- we need to implement new
Option
to setpluginMetricsSamplePercent
as well.
- we need to implement new
- create new
Option
to pass a function to setCycleState.recordPluginMetrics
in each scheduling cycle.- default should be
rand.Intn(100) < pluginMetricsSamplePercent
not to break the current behavior.
- default should be
- Add function to set
CycleState.recordPluginMetrics
in each scheduling cycle toScheduler
field likeNextPod
orError
. - just change
pluginMetricsSamplePercent
tovar
and expose it to be able to be changed value from other package. - (... do you have other good way?)
/kind feature /sig scheduling
Why is this needed?
In sigs/kube-scheduler-simulator, we want to set pluginMetricsSamplePercent
100 to see the metrics of all scheduling.
https://github.com/kubernetes-sigs/kube-scheduler-simulator/issues/60
Also, users may want to increase this value for some detailed/accurate performance measurement or may want to set to 0 when not using metrics at all.
@sanposhiho: This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Per discussion on sig-scheduling slack thread, the motivation for this feature is to correlate and identify the root cause of high latency spike in scheduling latency. The potential outcome of this feature can allow users to deduce suitable combination of configuration, i.e. percentageOfNodes + list of plugins.
For example, consider the two figures below,
where framework schedule duration figure depicts the scheduling latency (1m bucket) of a single attempt, and it captures a spike. However, the plugin duration figure does not capture the relevant spike. This is probably due to the sampling rate being 10%.
/cc @Huang-Wei
ping ~ are we okay to work on this ?
I think we need to discuss the way to achieve this since we have multiple options for implementation.
@Huang-Wei Could you please take a look at this? Which way do you think is better?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
This requirement of "correlating scheduling latency spikes with metrics" is valid to me. It's just I'm wondering if pluginMetricsSamplePercent
is the most efficient way to understand the correlation? Can other existing metrics help? Or, enable profiling?
/remove-lifecycle stale
This requirement of "correlating scheduling latency spikes with metrics" is valid to me. It's just I'm wondering if
pluginMetricsSamplePercent
is the most efficient way to understand the correlation? Can other existing metrics help? Or, enable profiling?/remove-lifecycle stale
At the moment, apart from "PluginExecutionDuration", the metrics "SchedulingAlgorithmLatency" and "schedulingLatency" reports back the total latency for a scheduling decision. However, both schedulingAlgorithm and schedulinglatency do not give clear pictures for the plugins related latency. They do, however, provide whether it was the binding step or the algorithm that causes higher latency.
"FrameworkExtensionPointDuration" does tell the extension point total latency, but it may not be granular enough to tell whether a particular plugin causes the latency spike.
I am not too familiar with the enable profiling flag, it seems that enabling it can help for pprof but I am not sure if users would want to enable this in prod setting.
Will investigate "enable profiling "
Overall, I still advocate that allowing users to configure pluginMetricsSamplePercent seems to be the most efficient way. This is especially the case in prod settings, where users have their own custom plugin that call external software to make scheduling decision. Would love to hear more if there is another way of allowing user to have a bit more insight on the plugin latency :)
@matthewygf I will bring it to the next sig meeting. You're welcome to join, it's this Thursday 10:00 AM PST.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
@matthewygf do you want to work on it? If so, please /assign
it to you.
I don't mean to force you to work on here, I can follow it if you want to leave it. I just think it's a good first issue for you to get involved in the kubernetes :)
@sanposhiho Thanks for pinging me ! Sure I will get myself to work on it. I have been busy and forgotten about this, nevertheless still an important feature for me and the community I think.
/assign
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
@sanposhiho @Huang-Wei Sorry for the delay, I have started working on this issue and created a branch locally. I have taken the direction to add add pluginMetricsSamplePercent to KubeSchedulerConfiguration
, however there are a couple of things I am unsure of.
- If I add the field in KubeSchedulerConfiguration, in the conversion to v1beta3 and v1beta2, do I need to worry about the option being supported ? If so, am I introducing API changes?
- Do you think we should add the
pluginMetricsSamplePercent
per schedulerProfile or only globally setting?
Thank you very much for the help in advance!
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.