karmada icon indicating copy to clipboard operation
karmada copied to clipboard

[Proposal] Support for cluster-level resource propagation pause and resume capabilities

Open XiShanYongYe-Chang opened this issue 1 year ago • 6 comments

What type of PR is this?

/kind design /kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes: Fixes #1567, #4421, #4688

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

XiShanYongYe-Chang avatar Jul 02 '24 03:07 XiShanYongYe-Chang

/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills Hi guys, can you help take a look?

XiShanYongYe-Chang avatar Jul 02 '24 03:07 XiShanYongYe-Chang

@XiShanYongYe-Chang: GitHub didn't allow me to request PR reviews from the following users: CharlesQQ.

Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills Hi guys, can you help take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

karmada-bot avatar Jul 02 '24 03:07 karmada-bot

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 28.25%. Comparing base (a87ec2a) to head (24e56c7). Report is 1163 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5118      +/-   ##
==========================================
+ Coverage   28.23%   28.25%   +0.01%     
==========================================
  Files         632      632              
  Lines       43723    43739      +16     
==========================================
+ Hits        12345    12358      +13     
- Misses      30473    30479       +6     
+ Partials      905      902       -3     
Flag Coverage Δ
unittests 28.25% <ø> (+0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Jul 02 '24 03:07 codecov-commenter

@XiShanYongYe-Chang it's also releated issue

whitewindmills avatar Jul 02 '24 03:07 whitewindmills

/assign

whitewindmills avatar Jul 02 '24 03:07 whitewindmills

Hi @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango , it's ready to review again.

XiShanYongYe-Chang avatar Jul 05 '24 08:07 XiShanYongYe-Chang

Hi! Thanks for your proposal!

In Solution One, the SchedulePause of ResourceBinding is propagated through PropagationPolicy. If I create a PropagationPolicy with a PauseStrategy set to nil by default, is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding? If this is feasible, does "propagation" mean that the associated ResourceBinding/Work is only updated when the PropagationPolicy updates the PauseStrategy field?

I also have a suggestion: can this issue be considered as a user story? In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

#4919

+1 What if set rb's schedulePause to true directly by webhook? this case will not get propagation involved, seems it's an independent case and is not like the work pause which is inherited from pp.

Monokaix avatar Jul 08 '24 07:07 Monokaix

is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding?

For solution 1, the source of user control is in the PropagationPolicy. If it is modified through webhook, it will be inconsistent with the declaration of PropagationPolicy.

I think solution 2 might be better for your scenario.

XiShanYongYe-Chang avatar Jul 08 '24 07:07 XiShanYongYe-Chang

In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

I understand that this is a specific implementation, not a user case. In fact, the two actions you point out can be done by controlling the pause of RB. Karmada is not aware of when you pause and resume.

That's why I pointed out that solution 2 is more suitable for your use case.

XiShanYongYe-Chang avatar Jul 08 '24 07:07 XiShanYongYe-Chang

When the user sets pause at the pp or cr level, other components such as wenhook may also set pause/resume at the rb or work level. There will be a conflict here. We should explain this limitation: )

Monokaix avatar Jul 08 '24 08:07 Monokaix

Yes, if you need to set it up via a webhook, the user needs to be aware of the restrictions.

XiShanYongYe-Chang avatar Jul 08 '24 08:07 XiShanYongYe-Chang

In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.

CharlesQQ avatar Jul 10 '24 15:07 CharlesQQ

In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.

Good question! What you say should be the definitive expected behavior, because when you change the op, work is updated, and when work is resumed, it is updated with the latest spec.

XiShanYongYe-Chang avatar Jul 11 '24 01:07 XiShanYongYe-Chang

I would prefer to introduce this functionality in PropagationPolicy as it is more convenient for users and straightforward than introducing another API.

Also proposing the API:

// PropagationSpec represents the desired behavior of PropagationPolicy.
type PropagationSpec struct {
	// Suspension declares the policy for suspending different aspects of propagation.
	// nil means no suspension. no default values.
	// +optional
	Suspension *Suspension `json:"suspension,omitempty"`
}

// Suspension defines the policy for suspending different aspects of propagation.
type Suspension struct {
	// Scheduling controls whether scheduling should be suspended.
	// +optional
	// Note: Postpone this until there is a solid use case.
	// Scheduling *bool `json:"scheduling,omitempty"`

	// Dispatching controls whether dispatching should be suspended.
	// nil means not suspend, no default value, only accepts 'true'.
	// Note: true means stop propagating to all clusters. Can not co-exist
	// with DispatchingOnClusters which is used to suspend particular clusters.
	// +optional
	Dispatching *bool `json:"dispatching,omitempty"`

	// DispatchingOnClusters declares a list of clusters to which the dispatching
	// should be suspended.
	// Note: Can not co-exist with Dispatching which is used to suspend all.
	// +optional
	DispatchingOnClusters *SuspendClusters `json:"dispatchingOnClusters,omitempty"`
}

// SuspendClusters represents a group of clusters that should be suspended from propagating.
// Note: No plan to introduce the label selector or field selector to select clusters yet, as it
// would make the system unpredictable.
type SuspendClusters struct {
	// ClusterNames is the list of clusters to be selected.
	// +optional
	ClusterNames []string `json:"clusterNames,omitempty"`
}

Note: ResourceBinding and Work also need similar field(s).

RainbowMango avatar Jul 13 '24 11:07 RainbowMango

Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?

Monokaix avatar Jul 15 '24 02:07 Monokaix

Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?

For scheduling paused, resources will not be pre-occupied. For distribution pause, I think it will.

XiShanYongYe-Chang avatar Jul 15 '24 07:07 XiShanYongYe-Chang

/retest

XiShanYongYe-Chang avatar Jul 15 '24 11:07 XiShanYongYe-Chang

Hi guys, at the moment, I'm inclined towards Plan One, as it is more straightforward and convenient for users, without the need for learning new APIs. If we introduce a new API to describe the pause strategy for resources, it might seem a bit limited. Later on, if we provide a Rollout API, we can then focus on thinking about adding new APIs.

How do you think? @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango @a7i @Monokaix @Vacant2333

XiShanYongYe-Chang avatar Jul 16 '24 09:07 XiShanYongYe-Chang

I vote for solution one and sent the draft API on https://github.com/karmada-io/karmada/pull/5118#issuecomment-2226864919.

RainbowMango avatar Jul 16 '24 09:07 RainbowMango

That sounds good!

Once approved/finalized, would you be open to contributions or did you want to lead this change?

a7i avatar Jul 16 '24 22:07 a7i

Once approved/finalized, would you be open to contributions or did you want to lead this change?

Of course, I will create a new issue to track the tasks that need to be completed, and I look forward to your participation in some of these tasks.

XiShanYongYe-Chang avatar Jul 17 '24 01:07 XiShanYongYe-Chang

Hi Amir, After we make the decision here, we can get started from #4838(the one you sent 3 months ago, sorry again for the delay). As @XiShanYongYe-Chang said above, he will help to open an umbrella issue to track all the tasks there, including code, testing, etc, feel free to take any tasks you are interested in, even all of them, just leave a comment on that umbrella issue would be fine. Thank you in advance!

RainbowMango avatar Jul 17 '24 02:07 RainbowMango

I vote for solution one and sent the draft API on https://github.com/karmada-io/karmada/pull/5118#issuecomment-2226864919.

+1

CharlesQQ avatar Jul 17 '24 06:07 CharlesQQ

/retest

RainbowMango avatar Jul 19 '24 09:07 RainbowMango

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar Jul 19 '24 09:07 karmada-bot