karmada Proposal for accurate estimator improvement for CRD scheduling

What type of PR is this?

/kind design

What this PR does / why we need it: Described in document.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer: Proposal doc for CRD scheduling improvements. Posting proposal following discussion in community meeting.

Does this PR introduce a user-facing change?:

NONE

Jun 22 '24 23:06 mszacillo

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 28.29%. Comparing base (c8acebc) to head (c026200). Report is 34 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5085      +/-   ##
==========================================
+ Coverage   28.21%   28.29%   +0.08%     
==========================================
  Files         632      632              
  Lines       43568    43635      +67     
==========================================
+ Hits        12291    12345      +54     
- Misses      30381    30388       +7     
- Partials      896      902       +6

Flag	Coverage Δ
unittests	`28.29% <ø> (+0.08%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 23 '24 00:06 codecov-commenter

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

Jun 24 '24 19:06 mszacillo

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

I'm ok with it since this is the only topic for this meeting. I'll send a notice to the mailing group and slack channel, and gather feedback.

Jun 25 '24 03:06 RainbowMango

@RainbowMango Thanks so much for reviewing this with me during the community meeting!

Just to add more context here, as I heard that is also work being done to support CRDs with multiple pod templates (like FlinkDeployment, or TensorFlow jobs for instance). For the FlinkDeployment, we cannot have replicas for the same job scheduled on different clusters - meaning we either schedule all pods on one cluster, or do not schedule at all. Once we schedule the CRD to a member cluster, all pod scheduling will be taken care of by the Flink operator.

Thinking about this more, I think it makes more sense to approach this by using one of your suggestions, which was to make Components the top level API Definition, and have replicas defined within each individual component. If we need all replicas to be scheduled on one cluster, we can set the spreadConstraints on the related PropagationPolicy.

Jun 25 '24 17:06 mszacillo

/assign

Jul 02 '24 03:07 whitewindmills

Hi @RainbowMango @whitewindmills, I've gone ahead and made an update to the proposal doc with a more precise explanation of the implementation details. Please let me know if you have any comments or questions - and apologies in advance if this is a little dense, perhaps I can type this in LateX and attach an image of the algorithm specific sections. :)

Quick note - there still needs to be some work done on how multiple components would work from a customized resource modeling perspective. I'll try to add a section to that this weekend.

Jul 12 '24 22:07 mszacillo

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

Jul 17 '24 09:07 Monokaix

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

Yes this approach would not consider divided replicas based off the use-cases we've compiled (#5115). At the moment, most CRDs get scheduled to a single cluster rather than spread across multiple.

In terms of the precision of the calculation, you're right that it's not completely accurate. It's an estimate of how many CRDs could be fully packed on the destination cluster. However, we do guarantee that at least 1 CRD can be scheduled on a member. So at least there should never be a scenario where we schedule a CRD to a member cluster that does not have sufficient resources to hold it.

Jul 17 '24 14:07 mszacillo

Hi @mszacillo I'm going to take two weeks off and might be slow to respond during this time.

I believe this feature is extremely important, and I'll be focusing on this once I get back. Given this feature would get controllers, and schedulers involved, it's not that easy to come up with an ideal solution in a short time.

By the way, I guess, with the help of default resource interpreter of FlinkDeployment(thanks for your contribution, by the way), this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

Jul 20 '24 11:07 RainbowMango

Hi @RainbowMango, thanks for the heads up and enjoy your time off!

this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

Yes, this is currently not a blocker for us. We can get around with the existing maxReplica estimation while we determine a solution for multiple podTemplate support, and we are instead focusing on the failover feature enhancements. For our MVP using Karmada we need two things:

Karmada should attach a label / annotation to the workload if it fails over to another cluster: #4969. This is currently implemented and we're simply doing some testing and some internal code review before opening up a PR upstream. This change will also address an existing bug that was detailed here: #4215, but guaranteeing Karmada will not reschedule workloads to the cluster it has failed-over from.
Address the limitation in the FederatedResourceQuota discussed in #5179, to which I am currently implementing a workaround.

After we've completed the above tickets our order of priority will be publishing the implementation for the later steps of the failover history proposal, and then working on the multiple pod template support.

Jul 20 '24 17:07 mszacillo

Hi, did you make any progress or is there any available prototype to try?

Oct 23 '24 11:10 KunWuLuan

Will you support new replicaSchedulingType like Gang ?

Oct 23 '24 12:10 KunWuLuan

Hi, did you make any progress or is there any available prototype to try?

Not yet. I like this proposal very much, and glad to help move this forward.

Oct 24 '24 09:10 RainbowMango

Will you support new replicaSchedulingType like Gang ?

Maybe not as Gang scheduling is already supported I guess. Please let us know your use case if you are interested in this proposal.

Oct 24 '24 09:10 RainbowMango

Will you support new replicaSchedulingType like Gang ?

Maybe not as Gang scheduling is already supported I guess. Please let us know your use case if you are interested in this proposal.

I found the proposal in #5218. I think this is what I need. We need to ensure the minimum replicas of job can be run in one cluster.

Oct 24 '24 09:10 KunWuLuan

I found the proposal in #5218. I think this is what I need. We need to ensure the minimum replicas of job can be run in one cluster.

Out of curiosity, which CRD are you working with? Until this pod template proposal is implemented, you could customize the way your replica's are interpreted so that the resource requirement takes the max(cpu, memory) of all your templates. Replica estimation would then be able to determine if you could schedule all your replicas on a single cluster.

Oct 24 '24 12:10 mszacillo

Hi, we also have some user case of multi pod template of volcano job(see https://github.com/volcano-sh/volcano), and we have designed a complete API of multi pod template here https://github.com/karmada-io/karmada/blob/79e374dd3840254e40d8abb33479fc81175a356b/docs/proposals/multi-template-partitionting-and-scheduler-extensibility/README.md#update-resourcebinding-api-definition, which we think is more robust, could you take a look?

Nov 28 '24 03:11 Monokaix

Here is a volcano job CRD example: https://github.com/volcano-sh/volcano/blob/master/example/MindSpore-example/mindspore_gpu/mindspore-gpu.yaml

Nov 28 '24 03:11 Monokaix

Hi @RainbowMango @Monokaix @mszacillo ,

We also have a use-case involving RayJob and RayCluster that requires support for multiple replica templates. Could you please let me know whether there is a roadmap or timeline for advancing this feature? I would be delighted to help move it forward.

Jul 10 '25 14:07 seanlaii

I'm glad to see more and more use cases require this feature. You can see this feature is planned in release-1.15 from the backlog. I believe it can be accelerated with your help. I will try to add some of my draft ideas to this proposal and then maybe you can iterate on this further.

Jul 11 '25 02:07 RainbowMango

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [RainbowMango]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 11 '25 09:07 karmada-bot

karmada karmada copied to clipboard

Proposal for accurate estimator improvement for CRD scheduling

Codecov Report

karmada
karmada copied to clipboard