karmada icon indicating copy to clipboard operation
karmada copied to clipboard

Adding proposal for workload affinity and anti-affinity support

Open mszacillo opened this issue 3 months ago • 10 comments

What type of PR is this? /kind feature

What this PR does / why we need it: This PR submits a proposal for adding workload affinity and workload anti-affinity as rules that the user can define on the PropagationPolicy. These rules would allow Karmada to support more complex scheduling challenges such as:

  • Spreading duplicate FlinkDeployment pipelines across clusters to guarantee HA
  • Scheduling many smaller training jobs to be co-located in same cluster

mszacillo avatar Aug 24 '25 23:08 mszacillo

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 46.61%. Comparing base (00d5f45) to head (21ea014). :warning: Report is 298 commits behind head on master. :exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6685      +/-   ##
==========================================
+ Coverage   45.37%   46.61%   +1.24%     
==========================================
  Files         688      698      +10     
  Lines       56567    48003    -8564     
==========================================
- Hits        25666    22377    -3289     
+ Misses      29298    23941    -5357     
- Partials     1603     1685      +82     
Flag Coverage Δ
unittests 46.61% <ø> (+1.24%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Aug 24 '25 23:08 codecov-commenter

Thank you @mszacillo . We can try to start it in release-1.16. /milstone v1.16

RainbowMango avatar Aug 25 '25 01:08 RainbowMango

Hi @kevin-wangzefeng, we went over this proposal (specifically going over the use-cases and the API design) during the community meeting today. I think the next step here is agreeing on the API changes, and once that is solidified, starting to determine an implementation strategy. Was hoping you could take a look, given your contributions to this feature in Kubernetes. Thank you!!

mszacillo avatar Sep 16 '25 17:09 mszacillo

Thanks for the review @RainbowMango! I wasn't sure whether to be more thorough with the implementation details - perhaps we can discuss this more in the next community meeting. I can start to draft some iteration tasks in the meantime.

mszacillo avatar Oct 07 '25 02:10 mszacillo

I was focusing on the API changes part during last review. I will leave my comments here about the code changes part. Also, I hope to explore the way how Kubernetes implements the affinity/anti-affinity feature.

RainbowMango avatar Oct 09 '25 09:10 RainbowMango

Hi @XiShanYongYe-Chang @RainbowMango, are there any comments / thoughts on the proposal or implementation details?

mszacillo avatar Oct 23 '25 02:10 mszacillo

My apologies for the delay. I'm currently focused on delivering the planned features for this release cycle, along with some mentor projects that are approaching their deadlines. I will get back on this as soon as possible.

RainbowMango avatar Oct 23 '25 04:10 RainbowMango

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar Nov 24 '25 04:11 karmada-bot

Hi @RainbowMango @kevin-wangzefeng, I've made several changes to the proposal based on our conversation during the last community meeting.

  1. Like Kevin had mentioned, I believe we should provide a new config that can be passed to the scheduler that will define the labels we must index over when validating affinity rules. Other labels will simply be ignored.
  2. I've decreased the complexity of the API and provided two options. The first gives the user more control to define a LabelSelector as well as MatchLabelKeys. The other option simplifies the API entirely and just turns affinity into something the user can enable. The default behavior would be to match workloads off the values of the label keys provided to the scheduler.

Let me know what you think.

mszacillo avatar Dec 01 '25 02:12 mszacillo

Thanks @mszacillo , I'm looking at it and will get back to you soon.

RainbowMango avatar Dec 02 '25 12:12 RainbowMango

@mszacillo @kevin-wangzefeng Could you please take a look at this API design? The design idea is explained in the comments.

// WorkloadAffinity defines inter-workload affinity and anti-affinity rules.
type WorkloadAffinity struct {
	// AffinityLabelKey declares the label key on the workload template that exposes
	// its affinity identity. The label value under this key represents the
	// workload's group identifier that other workloads can match against.
	//
	// Only a single key is supported initially. If multiple keys are needed in
	// the future, a new field (e.g., labelKeys) can be added without breaking
	// compatibility.
	//
	// The key must be a valid Kubernetes label key.
	//
	// +required
	AffinityLabelKey string `json:"affinityLabelKey"`

	// Affinity represents the inter-workload affinity scheduling rules.
	// These are hard requirements - workloads will only be scheduled to clusters that
	// satisfy the affinity term if specified.
	//
	// +optional
	Affinity *WorkloadAffinityTerm `json:"affinity,omitempty"`

	// AntiAffinity represents the inter-workload anti-affinity scheduling rules.
	// These are hard requirements - workloads will be scheduled to avoid clusters
	// where matching workloads are already scheduled.
	//
	// +optional
	AntiAffinity *WorkloadAntiAffinityTerm `json:"antiAffinity,omitempty"`

	// Note: The Affinity and AntiAffinity terms are both required to be met during scheduling.
	// If we need more flexible rules (e.g., preferred scheduling), we can consider adding
	// PreferredAffinity and PreferredAntiAffinity fields in the future.
}

// WorkloadAffinityTerm defines affinity rules for co-locating with specific workload groups.
type WorkloadAffinityTerm struct {
	// RequireSameGroup indicates the workload should co-locate with workloads
	// that share the same affinity group. When enabled, the scheduler will only
	// permit clusters where workloads with the same affinity group already exist.
	//
	// This field is required currently as more expressive selectors are not yet
	// supported. When additional selector fields are introduced in the future,
	// this field will be made optional to allow alternative affinity matching strategies.
	//
	// +kubebuilder:validation:Enum=true
	// +required
	RequireSameGroup *bool `json:"requireSameGroup"`

	// AllowBootstrap controls whether this workload can be scheduled when no workloads
	// with the same affinity group currently exist in the system.
	//
	// It defaults to true, which means the affinity requirement is relaxed for the first workload of
	// a group - if no existing workloads with the same affinity group are found, the scheduler will not
	// block scheduling. This allows bootstrapping new workload groups without
	// encountering scheduling deadlocks, providing the best user experience out of the box.
	//
	// When set to false, strict affinity enforcement is applied - the workload
	// will only be scheduled if workloads with the same affinity group already exist in the system.
	//
	// +kubebuilder:default=true
	// +optional
	AllowBootstrap *bool `json:"allowBootstrap,omitempty"`
}

// WorkloadAntiAffinityTerm defines anti-affinity rules for separating from specific workload groups.
type WorkloadAntiAffinityTerm struct {
	// RejectSameGroup indicates the workload should reject co-locating with workloads
	// that share the same affinity group. When enabled, the scheduler will reject
	// clusters where workloads with the same affinity group already exist.
	// This is a hard constraint - workloads will not be scheduled to such clusters.
	//
	// This field is required currently as more expressive selectors are not yet
	// supported. When additional selector fields are introduced in the future,
	// this field will be made optional to allow alternative anti-affinity matching strategies.
	//
	// +kubebuilder:validation:Enum=true
	// +required
	RejectSameGroup *bool `json:"rejectSameGroup"`
}

RainbowMango avatar Dec 16 '25 03:12 RainbowMango

Could you please take a look at this API design? The design idea is explained in the comments.

@RainbowMango Thanks for the API update! Reading through it just now, it seems like it covers my requirements quite well and is simple enough for new users to be able to test the feature themselves without having to alter any of the scheduler configurations.

The one potentially confusing thing for users might be the RejectSameGroup and RequireSameGroup settings, since they essentially just serve the function of "toggling on" the Affinity type. I understand why this was designed this way though, since we expect more expressive selectors to potentially be supported in the future. Perhaps we could just have these default to true if the user has already defined Affinity or AntiAffinity, so that the user doesn't have to set this additional field themselves? That's more a nit than anything though.

By the way, I mocked up an internal implementation of the feature using a simplified API that ended up quite similar to yours actually - my AntiAffinityWorkloadTerm is:

type WorkloadAntiAffinityTerm struct {
	// Determines if Karmada scheduler will take affinity terms into account for this workload.
	// Affinity labels are controlled via the --affinity-label-key configuration defined on the Karmada Scheduler.
	// Affinity and anti-affinity peers are determined based off matching values of the keys specified in the scheduler configuration.
	// +required
	AffinityEnabled bool `json:"affinityEnabled,omitempty"`

I used the original idea of a scheduler config and then parse the --affinity-label-key config (karmada.io/group in this case) and use that when generating the ResourceBinding, which will include the label in the ObjectReference. I reuse this label's value during scheduling using the updated cache + snapshot which I described a bit in the proposal. And then filter out clusters using a custom plugin. I'll share a link to the implementation tomorrow morning - I need to port it over to my github repo since its currently saved on an internal repo which I can't provide a link to.

mszacillo avatar Dec 16 '25 04:12 mszacillo

@RainbowMango Went ahead and updated my POC with your API suggestion. Did some E2E testing to confirm that this will work, which it did! :)

My truncated PropagationPolicy with the new fields:

spec:
  placement:
    workloadAffinity:
      affinityLabelKey: karmada.io/group
      antiAffinity:
        rejectSameGroup: true

When scheduling my initial FlinkDeployment which has been labeled with the affinityLabelKey:

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  labels:
    karmada.io/group: aggregator-job   <---- affinity label
  name: a-22-ticker-demo
  namespace: s-spaaseng
...

This generates a ResourceBinding that has parsed out the relevant affinity label key:

spec:
  clusters:
  - name: spaas-kaas-ob-dev
    replicas: 2
  placement:
    workloadAffinity:
      affinityLabelKey: karmada.io/group
      antiAffinity:
        rejectSameGroup: true
  resource:
    affinityGroupLabel:
      karmada.io/group: aggregator-job   <---- affinity label
    apiVersion: flink.apache.org/v1beta1
    kind: FlinkDeployment
    name: a-22-ticker-demo
    namespace: s-spaaseng
    resourceVersion: "10794181"
    uid: fc39af17-d5a2-4c9b-94ee-a30fdd8edf32

During scheduling phase, I updated the snapshot object + scheduler cache to index these resourcebindings depending on their affinity label keys. Once the snapshot is generated, we re-use it during our filter plugin which will filter out clusters that violate the affinity term for the resource being scheduled. I rescheduled two more applications with the same label key to test this out, and saw the scheduler filtering out clusters that had a conflicting RB!

Second application:

I1216 22:29:20.123054       1 scheduler.go:437] Start to schedule ResourceBinding(s-spaaseng/a-4-ticker-demo-flinkdeployment) as placement changed
I1216 22:29:20.123189       1 scheduler.go:605] "Begin scheduling resource binding with ClusterAffinity" resourceBinding="s-spaaseng/a-4-ticker-demo-flinkdeployment"
I1216 22:29:20.123555       1 generic_scheduler.go:142] Cluster "spaas-kaas-ob-dev" is not fit, reason: cluster violates this resource bindings anti-affinity term
I1216 22:29:20.123821       1 generic_scheduler.go:142] Cluster "spaas-kaas-tt-dev" is not fit, reason: cluster(s) had untolerated taint {cluster.karmada.io/maintenance:NoSchedule}
I1216 22:29:20.123863       1 generic_scheduler.go:89] Feasible clusters found: [spaas-kaas-pw-dev]

Third application (scheduling unsuccessful):

I1216 22:30:41.408245       1 scheduler.go:437] Start to schedule ResourceBinding(s-spaaseng/a-5-ticker-demo-flinkdeployment) as placement changed
I1216 22:30:41.408299       1 scheduler.go:605] "Begin scheduling resource binding with ClusterAffinity" resourceBinding="s-spaaseng/a-5-ticker-demo-flinkdeployment"
I1216 22:30:41.408667       1 generic_scheduler.go:142] Cluster "spaas-kaas-ob-dev" is not fit, reason: cluster violates this resource bindings anti-affinity term
I1216 22:30:41.408706       1 generic_scheduler.go:142] Cluster "spaas-kaas-pw-dev" is not fit, reason: cluster violates this resource bindings anti-affinity term
I1216 22:30:41.408785       1 generic_scheduler.go:142] Cluster "spaas-kaas-tt-dev" is not fit, reason: cluster(s) had untolerated taint {cluster.karmada.io/maintenance:NoSchedule}
I1216 22:30:41.408818       1 scheduler.go:623] ResourceBinding(s-spaaseng/a-5-ticker-demo-flinkdeployment) scheduled to clusters []
I1216 22:30:41.421980       1 scheduler.go:724] Patch schedule to ResourceBinding(s-spaaseng/a-5-ticker-demo-flinkdeployment) succeed
I1216 22:30:41.422066       1 event_handler.go:181] Ignore update event of resourceBinding s-spaaseng/a-5-ticker-demo-flinkdeployment as specification no change
I1216 22:30:41.422066       1 scheduler.go:629] "End scheduling resource binding with ClusterAffinity" resourceBinding="s-spaaseng/a-5-ticker-demo-flinkdeployment"
I1216 22:30:41.422227       1 event.go:389] "Event occurred" object="s-spaaseng/a-5-ticker-demo" fieldPath="" kind="FlinkDeployment" apiVersion="flink.apache.org/v1beta1" type="Warning" reason="ScheduleBindingFailed" message="0/3 clusters are available: 1 cluster(s) had untolerated taint {cluster.karmada.io/maintenance:NoSchedule}, 2 cluster violates this resource bindings anti-affinity term."

Changing the group value allows scheduling to succeed.

Obviously some optimizations will be necessary, but I thought this would give us some confidence that this API seems to work quite well. Please take a look at the POC branch I've linked below, I've already started drafting some iteration tasks based off the POC implementation.

Here is the branch for reference.

mszacillo avatar Dec 16 '25 23:12 mszacillo

Perhaps we could just have these default to true if the user has already defined Affinity or AntiAffinity, so that the user doesn't have to set this additional field themselves?

Thanks for the feedback! I understand the concern about the seemingly redundant field. However, the current design prioritizes future extensibility and backward compatibility.

The key consideration is what happens when we introduce more expressive selectors in the future. If we default RequireSameGroup and RejectSameGroup to true now, we would need to change that default behavior later when alternative matching strategies become available. This could break existing user configurations in subtle ways.

Here's the scenario I'm concerned about:

  1. Today: User creates a policy with affinity: {}, and RequireSameGroup defaults to true
  2. Future: We add new selector fields like targetGroups, and probably we would change or remove the default setting.
  3. Problem: The old policy still has the implicit RequireSameGroup: true default, which may conflict with or override the new selector semantics

By requiring users to explicitly set RequireSameGroup: true and RejectSameGroup: true now, we:

  • Make the current behavior explicit and clear in the configuration
  • Reserve the flexibility to make this field optional later without changing defaults
  • Ensure that when we add new selectors, existing policies won't have hidden implicit behaviors that interfere with the new fields

When we introduce additional selector options, we can simply make RequireSameGroup and RejectSameGroup optional, and users will be able to choose between different matching strategies without any migration concerns. This is a fully backward-compatible evolution path.

I agree it's a bit verbose for the initial use case, but I believe it's a worthwhile trade-off for long-term API stability and extensibility.

RainbowMango avatar Dec 17 '25 02:12 RainbowMango