karmada icon indicating copy to clipboard operation
karmada copied to clipboard

Proposal of introducing a rebalance mechanism to actively trigger rescheduling of resource

Open chaosi-zju opened this issue 11 months ago • 5 comments

What type of PR is this?

/kind design /kind documentation

What this PR does / why we need it:

Proposal of introducing a rebalance mechanism to actively trigger rescheduling of resource.

Assuming the user has propagated the workloads to member clusters, in some scenarios the current replicas distribution is not the most expected, such as:

  • replicas migrated due to cluster failover, while now cluster recovered.
  • replicas migrated due to application-level failover, while now each cluster has sufficient resources to run the replicas.
  • as for Aggregated schedule strategy, replicas were initially distributed across multiple clusters due to resource constraints, but now one cluster is enough to accommodate all replicas.

Therefore, the user desires for an approach to trigger rescheduling so that the replicas distribution can do a rebalance.

Which issue(s) this PR fixes:

Fixes part of #4840

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


chaosi-zju avatar Mar 12 '24 05:03 chaosi-zju

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 53.33%. Comparing base (5bc8c54) to head (0e1922c). Report is 113 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4698      +/-   ##
==========================================
+ Coverage   53.12%   53.33%   +0.20%     
==========================================
  Files         251      252       +1     
  Lines       20417    20482      +65     
==========================================
+ Hits        10847    10924      +77     
+ Misses       8856     8836      -20     
- Partials      714      722       +8     
Flag Coverage Δ
unittests 53.33% <ø> (+0.20%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Mar 12 '24 06:03 codecov-commenter

This Pr mixes fault self-healing and rescheduling. I think fault self-healing includes rescheduling, similar to when a node crashes, the workload corresponding to the pod on the node will regenerate the pod. This is completed by multiple controllers working together, including a scheduler. If the goal is self-healing, then multiple components need to be considered for coordination. If it is only rescheduling, then only the target of eviction and the conditions for stopping eviction need to be considered. Can we consider the design concept of the Descheduler project in the community

wu0407 avatar Mar 12 '24 11:03 wu0407

I did a hard job to made a thorough improvement of this proposal, now everyone can go through it all over again, looking forward to your suggestions~

chaosi-zju avatar May 09 '24 12:05 chaosi-zju

This Pr mixes fault self-healing and rescheduling.

@wu0407 Hello, I have updated this proposal. Actually, this proposal is about an entirely rescheduling, as for cluster failover is only a user story of it. For more imformation you can see in latest proposal, thank you for your comments~

chaosi-zju avatar May 09 '24 12:05 chaosi-zju

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar May 24 '24 01:05 karmada-bot