scheduler-plugins
scheduler-plugins copied to clipboard
coscheduling queue sort plugin starves pods
Currently coscheduling
plugin is using InitialAttemptTimestamp
to compare pods of the same priority. If there are enough pods with early InitialAttemptTimestamp
which cannot be scheduled then pods with later InitialAttemptTimestamp
will get starved - scheduler will never attempt to schedule them. This is because scheduler will re-queue "early" pods before "later" pods are attempted. Normal scheduler is using time when pod was inserted into the queue, so this situation cannot occur.
This sounds a reasonable optimization. @denkensk @cwdsuzhou thoughts?
@mateuszlitwin @Huang-Wei We may talk about it In the beginning. https://github.com/kubernetes/enhancements/pull/1463#discussion_r376465798 If we use the LastFailureTimestamp as the Normal scheduler. It will lead to undefined behavior in the heap.
Ah true, I fail to notice that point.
This is because scheduler will re-queue "early" pods before "later" pods are attempted.
@mateuszlitwin The failed PodGroup with an earlier timestamp will go through an internal backoff period, so that latter PodGroup is actually able to get scheduled, isn't it? If not, are you able to compose a simple test case to simulate this starvation?
@mateuszlitwin @Huang-Wei We may talk about it In the beginning. kubernetes/enhancements#1463 (comment) If we use the LastFailureTimestamp as the Normal scheduler. It will lead to undefined behavior in the heap.
+1 for this
Might be hard to design a simple test.
Issue occurred multiple times in the production environment where we had 100s of pods pending and 1000s of nodes to check. I observed that newer, recently created, pods were not attempted by the scheduler (based on lack of scheduling events and relevant logs), however older pods were attempted on a regular basis (but could not be scheduled because their scheduling constraints), at least once every sync. The issue went away when I disabled the coscheduling queue sort.
Maybe a test like this would reproduce the issue:
- create say 500 pods that are unschedulable
- then create a single pod that could be scheduled (it's timestamp in the queue will be greater than other 500 pods)
- generate some fake events in the cluster to move pods from backoff/unschedulable queues back to the active queue
I am not familiar with all the details how queuing works in the scheduler, but AFAIK certain events can put all pending pods back to the active queue, which could lead to the starvation I described where old unschedulable pods always go to the front of the active queue and starve pods which were in the queue for long time. Isn't the periodic flush/sync such event for example?
Coscheduling plugin queue sort is not compatible with the default sort. That is problematic especially because all scheduler profiles need to use the same sorting plugin, that is all profiles (e.g. default profile) are in fact forced to use co-scheduling sorting if co-scheduling is enabled.
Maybe with more customization for the queue plugin we could improve it?
however older pods were attempted on a regular basis (but could not be scheduled because their scheduling constraints), at least once every sync.
ok, it sounds like a head of line blocking problem. Have you tried to increase the backoff and flush settings to mitigate the symptom? (I know it's just a mitigation :))
Coscheduling plugin queue sort is not compatible with the default sort. That is problematic especially because all scheduler profiles need to use the same sorting plugin, that is all profiles (e.g. default profile) are in fact forced to use co-scheduling sorting if co-scheduling is enabled.
Totally understood the pain point here.
The queue sort design of co-scheduling is that we want a group of Pods to be treated as a unit to achieve higher efficiency, which is essential in a highly-utilized cluster. While in vanilla default scheduler, it just schedules pod by pod, so every time Pod gets re-queued, it doesn't need to consider its "sibling" pods, so it's possible to renew its enqueue time as a new item, while co-scheduling cannot, which is the embarrassing part.
Maybe with more customization for the queue plugin we could improve it?
We have some discussions in the upstream as well as this repo. I'm not quite sure I have the bandwidth to drive this in the near future. It'd be very appreciable if anyone is interested to drive the design & implementation.
We have some discussions in the upstream as well as this repo. I'm not quite sure I have the bandwidth to drive this in the near future. It'd be very appreciable if anyone is interested to drive the design & implementation.
Actually, we have a similar feature request about exposing more funcs in frameWorkHandler to ensure the pods belongs sorting in ActiveQueue
together.
@Huang-Wei do you have some links to the previous discussions?
@mateuszlitwin The upstream is attempting (very likely I will drive this in 1.21) to provide some efficient queueing mechanics so that developers can control the pod enqueuing behavior in a fine-grained manner.
Here are some references:
- https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit#
- https://github.com/kubernetes/kubernetes/pull/92206#issuecomment-662609698
- Avoid moving pods out of unschedulable status unconditionally
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
/kind bug /priority critical-urgent
I think it's still outstanding. I came across this when testing the v0.19.8 image. Here are the reproducing steps:
- Prepare a PodGroup with minMember=3
- Create a deployment with replicas=2
- Wait for the two pods of the deployment to be pending
- Scale up the deployment to be 3
- It's not uncommon the 3 pods get into a starving state, and cannot be scheduled overtime.
Thanks @Huang-Wei I will test and reproduce the problem.
/assign @denkensk
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@cwdsuzhou: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue can be resolved now since bump to 1.22.
I will open a PR to do this
Yes, the upstream scheduler framework surfaced an option to move pods back to activeQ: https://github.com/kubernetes/kubernetes/pull/103383. Coscheduling plugin can leverage that to move pods proactively.
/milestone v1.22 /assign
@Huang-Wei @cwdsuzhou Thank you for working on this.
Can you elaborate on how new feature can be used to address this issue? If I understand correctly, new feature allow us to put Pods to the queue more often. However, I think root of the problem is how pods are sorted in the active queue, not when/whether pods are inserted into the queue. For example, if we could guarantee that PodGroup pods were sorted by when was last schedule attempt of a PodGroup, then Pods (of the same priority) would not be starved.
@mateuszlitwin you're right, the feature is more focused on the inefficiency (or starvation) issue that different portions of a PodGroup may not converge in activeQ. So it's for one single PodGroup, while your issue is on multiple PodGroups. But the feature can, to some extent, help as it can efficiently schedule the early-queued PodGroup, and hence not blocking the latter-queued PodGroups.
if we could guarantee that PodGroup pods were sorted by when was last schedule attempt of a PodGroup,
This is the most ideal way, maybe complicated in terms of implementation but I think it's doable. We need to track when the pods belonging to a PodGropup have all finished their first attempt, then we can requeue the pods with a refreshed timestamp so that later-queued PodGroup gets their scheduling chance. It's like using a structure to "virtually" queue a PodGroup.
Any other thoughts are very welcome.
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten