fix(scheduler_one): call Done() as soon as possible
What type of PR is this?
/kind feature
What this PR does / why we need it:
call Done() as soon as possible -- specifically after WaitOnPermit.
All cluster events during scheduling are piled up in inFlightEvents. Basically, we need them to evaluate which queue to push failed Pods to. But, after https://github.com/kubernetes/kubernetes/pull/119105, we always push Pods to activeQ/backoffQ when those Pods are failed at extension points other than PreFilter, Filter, Reserve, and WaitOnPermit. In other words, if Pods go through WaitOnPermit, we don't need to have cluster events for those Pods. By calling Done() as soon as possible, we can reduce the memory consumed in inFlightEvents.
Which issue(s) this PR fixes:
Part of #120622
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Free events cached in the scheduling queue as soon as possible so that the scheduler consumes less memory.
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/cc @alculquicondor
Will check the test failure tomorrow.
cc @tosi3k
/retest
Can you run some memory profiling based on this, perhaps using scheduling_perf?
@ahg-g can take over the review from me if this requires extra changes on Friday or next week.
@sanposhiho can you rebase this? I would be great to merge something along these line before the code freeze.
We have an internal suite that we can use to validate whether it helps and report back.
OK, will do tomorrow morning JST.
/remove-lifecycle rotten
I suppose it's mistake
removing needs-rebase label by hand, automation did not clear it!
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with
/reopen - Mark this PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten /reopen
This is still relevant.
@sanposhiho: Reopened this PR.
In response to this:
/remove-lifecycle rotten /reopen
This is still relevant.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
@sanposhiho please rebase!
Rebased, now https://github.com/kubernetes/kubernetes/issues/118226 got solved and we're ready to merge it with no concern at all.
@kubernetes/sig-scheduling-approvers can anyone take a look for the approve?
cc @macsko
Can you possibly take a look? Sorry for having mentioned you my PRs everywhere 😅
/lgtm
LGTM label has been added.
cc @alculquicondor
for /approve
/retest
/release-note-edit
When SchedulerQueueingHints is enabled, clear events cached in the scheduling queue as soon as possible so that the scheduler consumes less memory.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: alculquicondor, sanposhiho
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~pkg/scheduler/OWNERS~~ [alculquicondor,sanposhiho]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.