[Proposal] Support for cluster-level resource propagation pause and resume capabilities
What type of PR is this?
/kind design /kind documentation
What this PR does / why we need it:
Which issue(s) this PR fixes: Fixes #1567, #4421, #4688
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
NONE
/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills Hi guys, can you help take a look?
@XiShanYongYe-Chang: GitHub didn't allow me to request PR reviews from the following users: CharlesQQ.
Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills Hi guys, can you help take a look?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 28.25%. Comparing base (
a87ec2a) to head (24e56c7). Report is 1163 commits behind head on master.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #5118 +/- ##
==========================================
+ Coverage 28.23% 28.25% +0.01%
==========================================
Files 632 632
Lines 43723 43739 +16
==========================================
+ Hits 12345 12358 +13
- Misses 30473 30479 +6
+ Partials 905 902 -3
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 28.25% <ø> (+0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@XiShanYongYe-Chang it's also releated issue
/assign
Hi @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango , it's ready to review again.
Hi! Thanks for your proposal!
In Solution One, the
SchedulePauseofResourceBindingis propagated throughPropagationPolicy. If I create aPropagationPolicywith aPauseStrategyset tonilby default, is it feasible to only update theSchedulePauseofResourceBindingto true via a webhook when thekarmada detectorcreates theResourceBinding? If this is feasible, does "propagation" mean that the associatedResourceBinding/Workis only updated when thePropagationPolicyupdates thePauseStrategyfield?I also have a suggestion: can this issue be considered as a user story? In brief, it involves pausing the
ResourceBindingat the time of creation through a webhook, and then using a custom controller to determine when certainResourceBindingscan be scheduled.#4919
+1
What if set rb's schedulePause to true directly by webhook? this case will not get propagation involved, seems it's an independent case and is not like the work pause which is inherited from pp.
is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding?
For solution 1, the source of user control is in the PropagationPolicy. If it is modified through webhook, it will be inconsistent with the declaration of PropagationPolicy.
I think solution 2 might be better for your scenario.
In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.
I understand that this is a specific implementation, not a user case. In fact, the two actions you point out can be done by controlling the pause of RB. Karmada is not aware of when you pause and resume.
That's why I pointed out that solution 2 is more suitable for your use case.
When the user sets pause at the pp or cr level, other components such as wenhook may also set pause/resume at the rb or work level. There will be a conflict here. We should explain this limitation: )
Yes, if you need to set it up via a webhook, the user needs to be aware of the restrictions.
In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.
In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.
Good question! What you say should be the definitive expected behavior, because when you change the op, work is updated, and when work is resumed, it is updated with the latest spec.
I would prefer to introduce this functionality in PropagationPolicy as it is more convenient for users and straightforward than introducing another API.
Also proposing the API:
// PropagationSpec represents the desired behavior of PropagationPolicy.
type PropagationSpec struct {
// Suspension declares the policy for suspending different aspects of propagation.
// nil means no suspension. no default values.
// +optional
Suspension *Suspension `json:"suspension,omitempty"`
}
// Suspension defines the policy for suspending different aspects of propagation.
type Suspension struct {
// Scheduling controls whether scheduling should be suspended.
// +optional
// Note: Postpone this until there is a solid use case.
// Scheduling *bool `json:"scheduling,omitempty"`
// Dispatching controls whether dispatching should be suspended.
// nil means not suspend, no default value, only accepts 'true'.
// Note: true means stop propagating to all clusters. Can not co-exist
// with DispatchingOnClusters which is used to suspend particular clusters.
// +optional
Dispatching *bool `json:"dispatching,omitempty"`
// DispatchingOnClusters declares a list of clusters to which the dispatching
// should be suspended.
// Note: Can not co-exist with Dispatching which is used to suspend all.
// +optional
DispatchingOnClusters *SuspendClusters `json:"dispatchingOnClusters,omitempty"`
}
// SuspendClusters represents a group of clusters that should be suspended from propagating.
// Note: No plan to introduce the label selector or field selector to select clusters yet, as it
// would make the system unpredictable.
type SuspendClusters struct {
// ClusterNames is the list of clusters to be selected.
// +optional
ClusterNames []string `json:"clusterNames,omitempty"`
}
Note: ResourceBinding and Work also need similar field(s).
Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?
Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?
For scheduling paused, resources will not be pre-occupied. For distribution pause, I think it will.
/retest
Hi guys, at the moment, I'm inclined towards Plan One, as it is more straightforward and convenient for users, without the need for learning new APIs. If we introduce a new API to describe the pause strategy for resources, it might seem a bit limited. Later on, if we provide a Rollout API, we can then focus on thinking about adding new APIs.
How do you think? @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango @a7i @Monokaix @Vacant2333
I vote for solution one and sent the draft API on https://github.com/karmada-io/karmada/pull/5118#issuecomment-2226864919.
That sounds good!
Once approved/finalized, would you be open to contributions or did you want to lead this change?
Once approved/finalized, would you be open to contributions or did you want to lead this change?
Of course, I will create a new issue to track the tasks that need to be completed, and I look forward to your participation in some of these tasks.
Hi Amir, After we make the decision here, we can get started from #4838(the one you sent 3 months ago, sorry again for the delay). As @XiShanYongYe-Chang said above, he will help to open an umbrella issue to track all the tasks there, including code, testing, etc, feel free to take any tasks you are interested in, even all of them, just leave a comment on that umbrella issue would be fine. Thank you in advance!
I vote for solution one and sent the draft API on https://github.com/karmada-io/karmada/pull/5118#issuecomment-2226864919.
+1
/retest
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: RainbowMango
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [RainbowMango]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment