scheduler-plugins coscheduling queue sort plugin starves pods

Currently coscheduling plugin is using InitialAttemptTimestamp to compare pods of the same priority. If there are enough pods with early InitialAttemptTimestamp which cannot be scheduled then pods with later InitialAttemptTimestamp will get starved - scheduler will never attempt to schedule them. This is because scheduler will re-queue "early" pods before "later" pods are attempted. Normal scheduler is using time when pod was inserted into the queue, so this situation cannot occur.

Nov 21 '20 04:11 mateuszlitwin

This sounds a reasonable optimization. @denkensk @cwdsuzhou thoughts?

Dec 01 '20 01:12 Huang-Wei

@mateuszlitwin @Huang-Wei We may talk about it In the beginning. https://github.com/kubernetes/enhancements/pull/1463#discussion_r376465798 If we use the LastFailureTimestamp as the Normal scheduler. It will lead to undefined behavior in the heap.

Dec 01 '20 02:12 denkensk

Ah true, I fail to notice that point.

This is because scheduler will re-queue "early" pods before "later" pods are attempted.

@mateuszlitwin The failed PodGroup with an earlier timestamp will go through an internal backoff period, so that latter PodGroup is actually able to get scheduled, isn't it? If not, are you able to compose a simple test case to simulate this starvation?

Dec 01 '20 03:12 Huang-Wei

@mateuszlitwin @Huang-Wei We may talk about it In the beginning. kubernetes/enhancements#1463 (comment) If we use the LastFailureTimestamp as the Normal scheduler. It will lead to undefined behavior in the heap.

+1 for this

Dec 04 '20 02:12 cwdsuzhou

Might be hard to design a simple test.

Issue occurred multiple times in the production environment where we had 100s of pods pending and 1000s of nodes to check. I observed that newer, recently created, pods were not attempted by the scheduler (based on lack of scheduling events and relevant logs), however older pods were attempted on a regular basis (but could not be scheduled because their scheduling constraints), at least once every sync. The issue went away when I disabled the coscheduling queue sort.

Maybe a test like this would reproduce the issue:

create say 500 pods that are unschedulable
then create a single pod that could be scheduled (it's timestamp in the queue will be greater than other 500 pods)
generate some fake events in the cluster to move pods from backoff/unschedulable queues back to the active queue

I am not familiar with all the details how queuing works in the scheduler, but AFAIK certain events can put all pending pods back to the active queue, which could lead to the starvation I described where old unschedulable pods always go to the front of the active queue and starve pods which were in the queue for long time. Isn't the periodic flush/sync such event for example?

Coscheduling plugin queue sort is not compatible with the default sort. That is problematic especially because all scheduler profiles need to use the same sorting plugin, that is all profiles (e.g. default profile) are in fact forced to use co-scheduling sorting if co-scheduling is enabled.

Maybe with more customization for the queue plugin we could improve it?

Dec 04 '20 20:12 mateuszlitwin

however older pods were attempted on a regular basis (but could not be scheduled because their scheduling constraints), at least once every sync.

ok, it sounds like a head of line blocking problem. Have you tried to increase the backoff and flush settings to mitigate the symptom? (I know it's just a mitigation :))

Coscheduling plugin queue sort is not compatible with the default sort. That is problematic especially because all scheduler profiles need to use the same sorting plugin, that is all profiles (e.g. default profile) are in fact forced to use co-scheduling sorting if co-scheduling is enabled.

Totally understood the pain point here.

The queue sort design of co-scheduling is that we want a group of Pods to be treated as a unit to achieve higher efficiency, which is essential in a highly-utilized cluster. While in vanilla default scheduler, it just schedules pod by pod, so every time Pod gets re-queued, it doesn't need to consider its "sibling" pods, so it's possible to renew its enqueue time as a new item, while co-scheduling cannot, which is the embarrassing part.

Maybe with more customization for the queue plugin we could improve it?

We have some discussions in the upstream as well as this repo. I'm not quite sure I have the bandwidth to drive this in the near future. It'd be very appreciable if anyone is interested to drive the design & implementation.

Dec 08 '20 02:12 Huang-Wei

We have some discussions in the upstream as well as this repo. I'm not quite sure I have the bandwidth to drive this in the near future. It'd be very appreciable if anyone is interested to drive the design & implementation.

Actually, we have a similar feature request about exposing more funcs in frameWorkHandler to ensure the pods belongs sorting in ActiveQueue together.

Dec 08 '20 02:12 cwdsuzhou

@Huang-Wei do you have some links to the previous discussions?

Dec 11 '20 21:12 mateuszlitwin

@mateuszlitwin The upstream is attempting (very likely I will drive this in 1.21) to provide some efficient queueing mechanics so that developers can control the pod enqueuing behavior in a fine-grained manner.

Here are some references:

https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit#
https://github.com/kubernetes/kubernetes/pull/92206#issuecomment-662609698
Avoid moving pods out of unschedulable status unconditionally

Dec 12 '20 01:12 Huang-Wei

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Mar 12 '21 02:03 fejta-bot

/remove-lifecycle stale

Mar 19 '21 20:03 Huang-Wei

/kind bug /priority critical-urgent

I think it's still outstanding. I came across this when testing the v0.19.8 image. Here are the reproducing steps:

Prepare a PodGroup with minMember=3
Create a deployment with replicas=2
Wait for the two pods of the deployment to be pending
Scale up the deployment to be 3
It's not uncommon the 3 pods get into a starving state, and cannot be scheduled overtime.

Mar 19 '21 20:03 Huang-Wei

Thanks @Huang-Wei I will test and reproduce the problem.

Mar 22 '21 02:03 denkensk

/assign @denkensk

Mar 22 '21 17:03 Huang-Wei

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 20 '21 18:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

Jul 20 '21 18:07 fejta-bot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Aug 19 '21 19:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 19 '21 19:08 k8s-ci-robot

/reopen

Aug 20 '21 04:08 cwdsuzhou

@cwdsuzhou: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 20 '21 04:08 k8s-ci-robot

This issue can be resolved now since bump to 1.22.

I will open a PR to do this

Aug 20 '21 04:08 cwdsuzhou

Yes, the upstream scheduler framework surfaced an option to move pods back to activeQ: https://github.com/kubernetes/kubernetes/pull/103383. Coscheduling plugin can leverage that to move pods proactively.

/milestone v1.22 /assign

Aug 20 '21 04:08 Huang-Wei

@Huang-Wei @cwdsuzhou Thank you for working on this.

Can you elaborate on how new feature can be used to address this issue? If I understand correctly, new feature allow us to put Pods to the queue more often. However, I think root of the problem is how pods are sorted in the active queue, not when/whether pods are inserted into the queue. For example, if we could guarantee that PodGroup pods were sorted by when was last schedule attempt of a PodGroup, then Pods (of the same priority) would not be starved.

Aug 23 '21 15:08 mateuszlitwin

@mateuszlitwin you're right, the feature is more focused on the inefficiency (or starvation) issue that different portions of a PodGroup may not converge in activeQ. So it's for one single PodGroup, while your issue is on multiple PodGroups. But the feature can, to some extent, help as it can efficiently schedule the early-queued PodGroup, and hence not blocking the latter-queued PodGroups.

if we could guarantee that PodGroup pods were sorted by when was last schedule attempt of a PodGroup,

This is the most ideal way, maybe complicated in terms of implementation but I think it's doable. We need to track when the pods belonging to a PodGropup have all finished their first attempt, then we can requeue the pods with a refreshed timestamp so that later-queued PodGroup gets their scheduling chance. It's like using a structure to "virtually" queue a PodGroup.

Any other thoughts are very welcome.

Aug 23 '21 18:08 Huang-Wei

/remove-lifecycle rotten

Aug 24 '21 05:08 seanmalloy

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 22 '21 05:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dec 22 '21 06:12 k8s-triage-robot

/remove-lifecycle rotten

Dec 22 '21 18:12 mateuszlitwin

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 22 '22 18:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 21 '22 18:04 k8s-triage-robot

scheduler-plugins scheduler-plugins copied to clipboard

coscheduling queue sort plugin starves pods

scheduler-plugins
scheduler-plugins copied to clipboard