kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

fix(scheduler_one): call Done() as soon as possible

Open sanposhiho opened this issue 2 years ago • 18 comments

What type of PR is this?

/kind feature

What this PR does / why we need it:

call Done() as soon as possible -- specifically after WaitOnPermit.


All cluster events during scheduling are piled up in inFlightEvents. Basically, we need them to evaluate which queue to push failed Pods to. But, after https://github.com/kubernetes/kubernetes/pull/119105, we always push Pods to activeQ/backoffQ when those Pods are failed at extension points other than PreFilter, Filter, Reserve, and WaitOnPermit. In other words, if Pods go through WaitOnPermit, we don't need to have cluster events for those Pods. By calling Done() as soon as possible, we can reduce the memory consumed in inFlightEvents.

Which issue(s) this PR fixes:

Part of #120622

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Free events cached in the scheduling queue as soon as possible so that the scheduler consumes less memory.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


sanposhiho avatar Sep 12 '23 01:09 sanposhiho

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 12 '23 01:09 k8s-ci-robot

/cc @alculquicondor

sanposhiho avatar Sep 12 '23 01:09 sanposhiho

Will check the test failure tomorrow.

sanposhiho avatar Sep 12 '23 10:09 sanposhiho

cc @tosi3k

alculquicondor avatar Sep 13 '23 20:09 alculquicondor

/retest

sanposhiho avatar Sep 20 '23 02:09 sanposhiho

Can you run some memory profiling based on this, perhaps using scheduling_perf?

alculquicondor avatar Sep 20 '23 17:09 alculquicondor

@ahg-g can take over the review from me if this requires extra changes on Friday or next week.

alculquicondor avatar Sep 20 '23 17:09 alculquicondor

@sanposhiho can you rebase this? I would be great to merge something along these line before the code freeze.

We have an internal suite that we can use to validate whether it helps and report back.

alculquicondor avatar Oct 23 '23 14:10 alculquicondor

OK, will do tomorrow morning JST.

sanposhiho avatar Oct 23 '23 15:10 sanposhiho

/remove-lifecycle rotten

I suppose it's mistake

sanposhiho avatar Oct 25 '23 12:10 sanposhiho

removing needs-rebase label by hand, automation did not clear it!

dims avatar Oct 25 '23 15:10 dims

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Feb 04 '24 08:02 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 04 '24 08:02 k8s-ci-robot

/remove-lifecycle rotten /reopen

This is still relevant.

sanposhiho avatar Feb 04 '24 08:02 sanposhiho

@sanposhiho: Reopened this PR.

In response to this:

/remove-lifecycle rotten /reopen

This is still relevant.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 04 '24 08:02 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 18 '24 12:05 k8s-triage-robot

/remove-lifecycle rotten

sanposhiho avatar May 20 '24 05:05 sanposhiho

@sanposhiho please rebase!

dims avatar Jun 25 '24 13:06 dims

Rebased, now https://github.com/kubernetes/kubernetes/issues/118226 got solved and we're ready to merge it with no concern at all.

@kubernetes/sig-scheduling-approvers can anyone take a look for the approve?

sanposhiho avatar Jun 28 '24 00:06 sanposhiho

cc @macsko

Can you possibly take a look? Sorry for having mentioned you my PRs everywhere 😅

sanposhiho avatar Jul 19 '24 10:07 sanposhiho

/lgtm

macsko avatar Jul 19 '24 13:07 macsko

LGTM label has been added.

Git tree hash: fe06d451a702bc14d48334b2e717e505d6397591

k8s-ci-robot avatar Jul 19 '24 13:07 k8s-ci-robot

cc @alculquicondor for /approve

sanposhiho avatar Jul 19 '24 13:07 sanposhiho

/retest

sanposhiho avatar Jul 19 '24 14:07 sanposhiho

/release-note-edit

When SchedulerQueueingHints is enabled, clear events cached in the scheduling queue as soon as possible so that the scheduler consumes less memory.

alculquicondor avatar Aug 19 '24 18:08 alculquicondor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 19 '24 18:08 k8s-ci-robot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 20 '24 03:08 k8s-ci-robot