kubernetes fix(scheduler_one): call Done() as soon as possible

What type of PR is this?

/kind feature

What this PR does / why we need it:

call Done() as soon as possible -- specifically after WaitOnPermit.

All cluster events during scheduling are piled up in inFlightEvents. Basically, we need them to evaluate which queue to push failed Pods to. But, after https://github.com/kubernetes/kubernetes/pull/119105, we always push Pods to activeQ/backoffQ when those Pods are failed at extension points other than PreFilter, Filter, Reserve, and WaitOnPermit. In other words, if Pods go through WaitOnPermit, we don't need to have cluster events for those Pods. By calling Done() as soon as possible, we can reduce the memory consumed in inFlightEvents.

Which issue(s) this PR fixes:

Part of #120622

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Free events cached in the scheduling queue as soon as possible so that the scheduler consumes less memory.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Sep 12 '23 01:09 sanposhiho

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 12 '23 01:09 k8s-ci-robot

/cc @alculquicondor

Sep 12 '23 01:09 sanposhiho

Will check the test failure tomorrow.

Sep 12 '23 10:09 sanposhiho

cc @tosi3k

Sep 13 '23 20:09 alculquicondor

/retest

Sep 20 '23 02:09 sanposhiho

Can you run some memory profiling based on this, perhaps using scheduling_perf?

Sep 20 '23 17:09 alculquicondor

@ahg-g can take over the review from me if this requires extra changes on Friday or next week.

Sep 20 '23 17:09 alculquicondor

@sanposhiho can you rebase this? I would be great to merge something along these line before the code freeze.

We have an internal suite that we can use to validate whether it helps and report back.

Oct 23 '23 14:10 alculquicondor

OK, will do tomorrow morning JST.

Oct 23 '23 15:10 sanposhiho

/remove-lifecycle rotten

I suppose it's mistake

Oct 25 '23 12:10 sanposhiho

removing needs-rebase label by hand, automation did not clear it!

Oct 25 '23 15:10 dims

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Feb 04 '24 08:02 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 04 '24 08:02 k8s-ci-robot

/remove-lifecycle rotten /reopen

This is still relevant.

Feb 04 '24 08:02 sanposhiho

@sanposhiho: Reopened this PR.

In response to this:

/remove-lifecycle rotten /reopen

This is still relevant.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 04 '24 08:02 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 18 '24 12:05 k8s-triage-robot

/remove-lifecycle rotten

May 20 '24 05:05 sanposhiho

@sanposhiho please rebase!

Jun 25 '24 13:06 dims

Rebased, now https://github.com/kubernetes/kubernetes/issues/118226 got solved and we're ready to merge it with no concern at all.

@kubernetes/sig-scheduling-approvers can anyone take a look for the approve?

Jun 28 '24 00:06 sanposhiho

cc @macsko

Can you possibly take a look? Sorry for having mentioned you my PRs everywhere 😅

Jul 19 '24 10:07 sanposhiho

/lgtm

Jul 19 '24 13:07 macsko

LGTM label has been added.

Git tree hash: fe06d451a702bc14d48334b2e717e505d6397591

Jul 19 '24 13:07 k8s-ci-robot

cc @alculquicondor for /approve

Jul 19 '24 13:07 sanposhiho

/retest

Jul 19 '24 14:07 sanposhiho

/release-note-edit

When SchedulerQueueingHints is enabled, clear events cached in the scheduling queue as soon as possible so that the scheduler consumes less memory.

Aug 19 '24 18:08 alculquicondor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [alculquicondor,sanposhiho]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 19 '24 18:08 k8s-ci-robot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 20 '24 03:08 k8s-ci-robot