karmada
karmada copied to clipboard
Proposal for accurate estimator improvement for CRD scheduling
What type of PR is this?
/kind design
What this PR does / why we need it: Described in document.
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer: Proposal doc for CRD scheduling improvements. Posting proposal following discussion in community meeting.
Does this PR introduce a user-facing change?:
NONE
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 28.29%. Comparing base (
c8acebc) to head (c026200). Report is 34 commits behind head on master.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #5085 +/- ##
==========================================
+ Coverage 28.21% 28.29% +0.08%
==========================================
Files 632 632
Lines 43568 43635 +67
==========================================
+ Hits 12291 12345 +54
- Misses 30381 30388 +7
- Partials 896 902 +6
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 28.29% <ø> (+0.08%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.
@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.
I'm ok with it since this is the only topic for this meeting. I'll send a notice to the mailing group and slack channel, and gather feedback.
@RainbowMango Thanks so much for reviewing this with me during the community meeting!
Just to add more context here, as I heard that is also work being done to support CRDs with multiple pod templates (like FlinkDeployment, or TensorFlow jobs for instance). For the FlinkDeployment, we cannot have replicas for the same job scheduled on different clusters - meaning we either schedule all pods on one cluster, or do not schedule at all. Once we schedule the CRD to a member cluster, all pod scheduling will be taken care of by the Flink operator.
Thinking about this more, I think it makes more sense to approach this by using one of your suggestions, which was to make Components the top level API Definition, and have replicas defined within each individual component. If we need all replicas to be scheduled on one cluster, we can set the spreadConstraints on the related PropagationPolicy.
/assign
Hi @RainbowMango @whitewindmills, I've gone ahead and made an update to the proposal doc with a more precise explanation of the implementation details. Please let me know if you have any comments or questions - and apologies in advance if this is a little dense, perhaps I can type this in LateX and attach an image of the algorithm specific sections. :)
Quick note - there still needs to be some work done on how multiple components would work from a customized resource modeling perspective. I'll try to add a section to that this weekend.
So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?
So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?
Yes this approach would not consider divided replicas based off the use-cases we've compiled (#5115). At the moment, most CRDs get scheduled to a single cluster rather than spread across multiple.
In terms of the precision of the calculation, you're right that it's not completely accurate. It's an estimate of how many CRDs could be fully packed on the destination cluster. However, we do guarantee that at least 1 CRD can be scheduled on a member. So at least there should never be a scenario where we schedule a CRD to a member cluster that does not have sufficient resources to hold it.
Hi @mszacillo I'm going to take two weeks off and might be slow to respond during this time.
I believe this feature is extremely important, and I'll be focusing on this once I get back. Given this feature would get controllers, and schedulers involved, it's not that easy to come up with an ideal solution in a short time.
By the way, I guess, with the help of default resource interpreter of FlinkDeployment(thanks for your contribution, by the way), this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?
Hi @RainbowMango, thanks for the heads up and enjoy your time off!
this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?
Yes, this is currently not a blocker for us. We can get around with the existing maxReplica estimation while we determine a solution for multiple podTemplate support, and we are instead focusing on the failover feature enhancements. For our MVP using Karmada we need two things:
-
Karmada should attach a label / annotation to the workload if it fails over to another cluster: #4969. This is currently implemented and we're simply doing some testing and some internal code review before opening up a PR upstream. This change will also address an existing bug that was detailed here: #4215, but guaranteeing Karmada will not reschedule workloads to the cluster it has failed-over from.
-
Address the limitation in the FederatedResourceQuota discussed in #5179, to which I am currently implementing a workaround.
After we've completed the above tickets our order of priority will be publishing the implementation for the later steps of the failover history proposal, and then working on the multiple pod template support.
Hi, did you make any progress or is there any available prototype to try?
Will you support new replicaSchedulingType like Gang ?
Hi, did you make any progress or is there any available prototype to try?
Not yet. I like this proposal very much, and glad to help move this forward.
Will you support new replicaSchedulingType like Gang ?
Maybe not as Gang scheduling is already supported I guess. Please let us know your use case if you are interested in this proposal.
Will you support new replicaSchedulingType like Gang ?
Maybe not as Gang scheduling is already supported I guess. Please let us know your use case if you are interested in this proposal.
I found the proposal in #5218. I think this is what I need. We need to ensure the minimum replicas of job can be run in one cluster.
I found the proposal in #5218. I think this is what I need. We need to ensure the minimum replicas of job can be run in one cluster.
Out of curiosity, which CRD are you working with? Until this pod template proposal is implemented, you could customize the way your replica's are interpreted so that the resource requirement takes the max(cpu, memory) of all your templates. Replica estimation would then be able to determine if you could schedule all your replicas on a single cluster.
Hi, we also have some user case of multi pod template of volcano job(see https://github.com/volcano-sh/volcano), and we have designed a complete API of multi pod template here https://github.com/karmada-io/karmada/blob/79e374dd3840254e40d8abb33479fc81175a356b/docs/proposals/multi-template-partitionting-and-scheduler-extensibility/README.md#update-resourcebinding-api-definition, which we think is more robust, could you take a look?
Here is a volcano job CRD example: https://github.com/volcano-sh/volcano/blob/master/example/MindSpore-example/mindspore_gpu/mindspore-gpu.yaml
Hi @RainbowMango @Monokaix @mszacillo ,
We also have a use-case involving RayJob and RayCluster that requires support for multiple replica templates. Could you please let me know whether there is a roadmap or timeline for advancing this feature? I would be delighted to help move it forward.
I'm glad to see more and more use cases require this feature. You can see this feature is planned in release-1.15 from the backlog. I believe it can be accelerated with your help. I will try to add some of my draft ideas to this proposal and then maybe you can iterate on this further.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: RainbowMango
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [RainbowMango]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment