Adding proposal for workload affinity and anti-affinity support
What type of PR is this? /kind feature
What this PR does / why we need it: This PR submits a proposal for adding workload affinity and workload anti-affinity as rules that the user can define on the PropagationPolicy. These rules would allow Karmada to support more complex scheduling challenges such as:
- Spreading duplicate FlinkDeployment pipelines across clusters to guarantee HA
- Scheduling many smaller training jobs to be co-located in same cluster
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 46.61%. Comparing base (00d5f45) to head (21ea014).
:warning: Report is 298 commits behind head on master.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #6685 +/- ##
==========================================
+ Coverage 45.37% 46.61% +1.24%
==========================================
Files 688 698 +10
Lines 56567 48003 -8564
==========================================
- Hits 25666 22377 -3289
+ Misses 29298 23941 -5357
- Partials 1603 1685 +82
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 46.61% <ø> (+1.24%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Thank you @mszacillo . We can try to start it in release-1.16. /milstone v1.16
Hi @kevin-wangzefeng, we went over this proposal (specifically going over the use-cases and the API design) during the community meeting today. I think the next step here is agreeing on the API changes, and once that is solidified, starting to determine an implementation strategy. Was hoping you could take a look, given your contributions to this feature in Kubernetes. Thank you!!
Thanks for the review @RainbowMango! I wasn't sure whether to be more thorough with the implementation details - perhaps we can discuss this more in the next community meeting. I can start to draft some iteration tasks in the meantime.
I was focusing on the API changes part during last review. I will leave my comments here about the code changes part. Also, I hope to explore the way how Kubernetes implements the affinity/anti-affinity feature.
Hi @XiShanYongYe-Chang @RainbowMango, are there any comments / thoughts on the proposal or implementation details?
My apologies for the delay. I'm currently focused on delivering the planned features for this release cycle, along with some mentor projects that are approaching their deadlines. I will get back on this as soon as possible.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @RainbowMango @kevin-wangzefeng, I've made several changes to the proposal based on our conversation during the last community meeting.
- Like Kevin had mentioned, I believe we should provide a new config that can be passed to the scheduler that will define the labels we must index over when validating affinity rules. Other labels will simply be ignored.
- I've decreased the complexity of the API and provided two options. The first gives the user more control to define a LabelSelector as well as MatchLabelKeys. The other option simplifies the API entirely and just turns affinity into something the user can enable. The default behavior would be to match workloads off the values of the label keys provided to the scheduler.
Let me know what you think.
Thanks @mszacillo , I'm looking at it and will get back to you soon.
@mszacillo @kevin-wangzefeng Could you please take a look at this API design? The design idea is explained in the comments.
// WorkloadAffinity defines inter-workload affinity and anti-affinity rules.
type WorkloadAffinity struct {
// AffinityLabelKey declares the label key on the workload template that exposes
// its affinity identity. The label value under this key represents the
// workload's group identifier that other workloads can match against.
//
// Only a single key is supported initially. If multiple keys are needed in
// the future, a new field (e.g., labelKeys) can be added without breaking
// compatibility.
//
// The key must be a valid Kubernetes label key.
//
// +required
AffinityLabelKey string `json:"affinityLabelKey"`
// Affinity represents the inter-workload affinity scheduling rules.
// These are hard requirements - workloads will only be scheduled to clusters that
// satisfy the affinity term if specified.
//
// +optional
Affinity *WorkloadAffinityTerm `json:"affinity,omitempty"`
// AntiAffinity represents the inter-workload anti-affinity scheduling rules.
// These are hard requirements - workloads will be scheduled to avoid clusters
// where matching workloads are already scheduled.
//
// +optional
AntiAffinity *WorkloadAntiAffinityTerm `json:"antiAffinity,omitempty"`
// Note: The Affinity and AntiAffinity terms are both required to be met during scheduling.
// If we need more flexible rules (e.g., preferred scheduling), we can consider adding
// PreferredAffinity and PreferredAntiAffinity fields in the future.
}
// WorkloadAffinityTerm defines affinity rules for co-locating with specific workload groups.
type WorkloadAffinityTerm struct {
// RequireSameGroup indicates the workload should co-locate with workloads
// that share the same affinity group. When enabled, the scheduler will only
// permit clusters where workloads with the same affinity group already exist.
//
// This field is required currently as more expressive selectors are not yet
// supported. When additional selector fields are introduced in the future,
// this field will be made optional to allow alternative affinity matching strategies.
//
// +kubebuilder:validation:Enum=true
// +required
RequireSameGroup *bool `json:"requireSameGroup"`
// AllowBootstrap controls whether this workload can be scheduled when no workloads
// with the same affinity group currently exist in the system.
//
// It defaults to true, which means the affinity requirement is relaxed for the first workload of
// a group - if no existing workloads with the same affinity group are found, the scheduler will not
// block scheduling. This allows bootstrapping new workload groups without
// encountering scheduling deadlocks, providing the best user experience out of the box.
//
// When set to false, strict affinity enforcement is applied - the workload
// will only be scheduled if workloads with the same affinity group already exist in the system.
//
// +kubebuilder:default=true
// +optional
AllowBootstrap *bool `json:"allowBootstrap,omitempty"`
}
// WorkloadAntiAffinityTerm defines anti-affinity rules for separating from specific workload groups.
type WorkloadAntiAffinityTerm struct {
// RejectSameGroup indicates the workload should reject co-locating with workloads
// that share the same affinity group. When enabled, the scheduler will reject
// clusters where workloads with the same affinity group already exist.
// This is a hard constraint - workloads will not be scheduled to such clusters.
//
// This field is required currently as more expressive selectors are not yet
// supported. When additional selector fields are introduced in the future,
// this field will be made optional to allow alternative anti-affinity matching strategies.
//
// +kubebuilder:validation:Enum=true
// +required
RejectSameGroup *bool `json:"rejectSameGroup"`
}
Could you please take a look at this API design? The design idea is explained in the comments.
@RainbowMango Thanks for the API update! Reading through it just now, it seems like it covers my requirements quite well and is simple enough for new users to be able to test the feature themselves without having to alter any of the scheduler configurations.
The one potentially confusing thing for users might be the RejectSameGroup and RequireSameGroup settings, since they essentially just serve the function of "toggling on" the Affinity type. I understand why this was designed this way though, since we expect more expressive selectors to potentially be supported in the future. Perhaps we could just have these default to true if the user has already defined Affinity or AntiAffinity, so that the user doesn't have to set this additional field themselves? That's more a nit than anything though.
By the way, I mocked up an internal implementation of the feature using a simplified API that ended up quite similar to yours actually - my AntiAffinityWorkloadTerm is:
type WorkloadAntiAffinityTerm struct {
// Determines if Karmada scheduler will take affinity terms into account for this workload.
// Affinity labels are controlled via the --affinity-label-key configuration defined on the Karmada Scheduler.
// Affinity and anti-affinity peers are determined based off matching values of the keys specified in the scheduler configuration.
// +required
AffinityEnabled bool `json:"affinityEnabled,omitempty"`
I used the original idea of a scheduler config and then parse the --affinity-label-key config (karmada.io/group in this case) and use that when generating the ResourceBinding, which will include the label in the ObjectReference. I reuse this label's value during scheduling using the updated cache + snapshot which I described a bit in the proposal. And then filter out clusters using a custom plugin. I'll share a link to the implementation tomorrow morning - I need to port it over to my github repo since its currently saved on an internal repo which I can't provide a link to.
@RainbowMango Went ahead and updated my POC with your API suggestion. Did some E2E testing to confirm that this will work, which it did! :)
My truncated PropagationPolicy with the new fields:
spec:
placement:
workloadAffinity:
affinityLabelKey: karmada.io/group
antiAffinity:
rejectSameGroup: true
When scheduling my initial FlinkDeployment which has been labeled with the affinityLabelKey:
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
labels:
karmada.io/group: aggregator-job <---- affinity label
name: a-22-ticker-demo
namespace: s-spaaseng
...
This generates a ResourceBinding that has parsed out the relevant affinity label key:
spec:
clusters:
- name: spaas-kaas-ob-dev
replicas: 2
placement:
workloadAffinity:
affinityLabelKey: karmada.io/group
antiAffinity:
rejectSameGroup: true
resource:
affinityGroupLabel:
karmada.io/group: aggregator-job <---- affinity label
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
name: a-22-ticker-demo
namespace: s-spaaseng
resourceVersion: "10794181"
uid: fc39af17-d5a2-4c9b-94ee-a30fdd8edf32
During scheduling phase, I updated the snapshot object + scheduler cache to index these resourcebindings depending on their affinity label keys. Once the snapshot is generated, we re-use it during our filter plugin which will filter out clusters that violate the affinity term for the resource being scheduled. I rescheduled two more applications with the same label key to test this out, and saw the scheduler filtering out clusters that had a conflicting RB!
Second application:
I1216 22:29:20.123054 1 scheduler.go:437] Start to schedule ResourceBinding(s-spaaseng/a-4-ticker-demo-flinkdeployment) as placement changed
I1216 22:29:20.123189 1 scheduler.go:605] "Begin scheduling resource binding with ClusterAffinity" resourceBinding="s-spaaseng/a-4-ticker-demo-flinkdeployment"
I1216 22:29:20.123555 1 generic_scheduler.go:142] Cluster "spaas-kaas-ob-dev" is not fit, reason: cluster violates this resource bindings anti-affinity term
I1216 22:29:20.123821 1 generic_scheduler.go:142] Cluster "spaas-kaas-tt-dev" is not fit, reason: cluster(s) had untolerated taint {cluster.karmada.io/maintenance:NoSchedule}
I1216 22:29:20.123863 1 generic_scheduler.go:89] Feasible clusters found: [spaas-kaas-pw-dev]
Third application (scheduling unsuccessful):
I1216 22:30:41.408245 1 scheduler.go:437] Start to schedule ResourceBinding(s-spaaseng/a-5-ticker-demo-flinkdeployment) as placement changed
I1216 22:30:41.408299 1 scheduler.go:605] "Begin scheduling resource binding with ClusterAffinity" resourceBinding="s-spaaseng/a-5-ticker-demo-flinkdeployment"
I1216 22:30:41.408667 1 generic_scheduler.go:142] Cluster "spaas-kaas-ob-dev" is not fit, reason: cluster violates this resource bindings anti-affinity term
I1216 22:30:41.408706 1 generic_scheduler.go:142] Cluster "spaas-kaas-pw-dev" is not fit, reason: cluster violates this resource bindings anti-affinity term
I1216 22:30:41.408785 1 generic_scheduler.go:142] Cluster "spaas-kaas-tt-dev" is not fit, reason: cluster(s) had untolerated taint {cluster.karmada.io/maintenance:NoSchedule}
I1216 22:30:41.408818 1 scheduler.go:623] ResourceBinding(s-spaaseng/a-5-ticker-demo-flinkdeployment) scheduled to clusters []
I1216 22:30:41.421980 1 scheduler.go:724] Patch schedule to ResourceBinding(s-spaaseng/a-5-ticker-demo-flinkdeployment) succeed
I1216 22:30:41.422066 1 event_handler.go:181] Ignore update event of resourceBinding s-spaaseng/a-5-ticker-demo-flinkdeployment as specification no change
I1216 22:30:41.422066 1 scheduler.go:629] "End scheduling resource binding with ClusterAffinity" resourceBinding="s-spaaseng/a-5-ticker-demo-flinkdeployment"
I1216 22:30:41.422227 1 event.go:389] "Event occurred" object="s-spaaseng/a-5-ticker-demo" fieldPath="" kind="FlinkDeployment" apiVersion="flink.apache.org/v1beta1" type="Warning" reason="ScheduleBindingFailed" message="0/3 clusters are available: 1 cluster(s) had untolerated taint {cluster.karmada.io/maintenance:NoSchedule}, 2 cluster violates this resource bindings anti-affinity term."
Changing the group value allows scheduling to succeed.
Obviously some optimizations will be necessary, but I thought this would give us some confidence that this API seems to work quite well. Please take a look at the POC branch I've linked below, I've already started drafting some iteration tasks based off the POC implementation.
Here is the branch for reference.
Perhaps we could just have these default to true if the user has already defined Affinity or AntiAffinity, so that the user doesn't have to set this additional field themselves?
Thanks for the feedback! I understand the concern about the seemingly redundant field. However, the current design prioritizes future extensibility and backward compatibility.
The key consideration is what happens when we introduce more expressive selectors in the future. If we default RequireSameGroup and RejectSameGroup to true now, we would need to change that default behavior later when alternative matching strategies become available. This could break existing user configurations in subtle ways.
Here's the scenario I'm concerned about:
Today: User creates a policy withaffinity: {}, andRequireSameGroupdefaults to trueFuture: We add new selector fields liketargetGroups, and probably we would change or remove the default setting.Problem: The old policy still has the implicitRequireSameGroup: truedefault, which may conflict with or override the new selector semantics
By requiring users to explicitly set RequireSameGroup: true and RejectSameGroup: true now, we:
- Make the current behavior explicit and clear in the configuration
- Reserve the flexibility to make this field optional later without changing defaults
- Ensure that when we add new selectors, existing policies won't have hidden implicit behaviors that interfere with the new fields
When we introduce additional selector options, we can simply make RequireSameGroup and RejectSameGroup optional, and users will be able to choose between different matching strategies without any migration concerns. This is a fully backward-compatible evolution path.
I agree it's a bit verbose for the initial use case, but I believe it's a worthwhile trade-off for long-term API stability and extensibility.