Stateful Failover Proposal
What type of PR is this? Proposal for stateful failover
What this PR does / why we need it: Explained in the doc
Which issue(s) this PR fixes: Fixes #5006, #4969
Special notes for your reviewer: N/A
Does this PR introduce a user-facing change?: NONE
Welcome @Dyex719! It looks like this is your first PR to karmada-io/karmada 🎉
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 40.91%. Comparing base (
2271a41) to head (fd35fb4). Report is 680 commits behind head on master.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #5116 +/- ##
===========================================
+ Coverage 28.21% 40.91% +12.69%
===========================================
Files 632 650 +18
Lines 43556 55182 +11626
===========================================
+ Hits 12291 22575 +10284
- Misses 30368 31170 +802
- Partials 897 1437 +540
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 40.91% <ø> (+12.69%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks @RainbowMango! One assumption that we wanted to discuss about (I wrote this in the proposal too):
All the replicas of the stateful application are not migrated together, it is not clear when the state needs to be restored. In this proposal we focus on the use case where all the replicas of a stateful application are migrated together.
Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.
Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.
I 100% agree.
- Partially replica migration is not technically failover, but more like elastic scaling of replicas.
- The conditions under which failover triggers are fully configurable, if people tolerate the failure of part of replicas, then migration should not be triggered, if they don't, the application should be rebuilt, by leveraging failover.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @Dyex719, @mszacillo, I've been thinking this feature recently and came up with some ideas, this feature consists of 3 parts:
The first part is how to declare which state data(fields) should be preserved during failover, and I post my idea at https://github.com/karmada-io/karmada/pull/5116#discussion_r1685365859. Please take a look.
The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.
I think it is worth thinking about storing them in the GracefulEvictionTask, during the failover process, just before creating the eviction task, it makes sense to grab some state(fields) or snapshot of scheduled cluster list. With this grabbed data, the controller would get to know which cluster is the destination by comparing the snapshot and the newly scheduled cluster.
[edit] I don't mean I don't like the first approach, just raise another idea. Maybe we also can add a snapshot of the scheduled cluster to the history item. The most challenging thing for this part is distinguishing which field should be managed by which component(scheduler, or controller), and they should be decoupled.
The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.
Hi @RainbowMango,
To address your comment:
The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.
One thing we were thinking about it is to only store the cluster from which the workload failed over from. This would help keep the logic simple without involving multiple components. The current cluster the workload is scheduled on can always be inferred from the resourcebinding.
With this we would achieve:
spec:
clusters:
- name: member3
replicas: 2
...
failoverHistory:
- failoverTime: "2024-07-18T19:03:06Z"
originCluster: member1
reason: "applicationFailover"
- failoverTime: "2024-07-18T19:08:45Z"
originCluster: member2
reason: "clusterFailover"
The entire trace of the workload could still be inferred from the current state + failoverHistory, the workload was originally on member1, then it migrated to member2, then it moved to member3 which can be inferred from spec.clusters in the resource binding where it is currently scheduled.
Work in progress implementation is available here: https://github.com/karmada-io/karmada/compare/master...Dyex719:karmada:stateful-failover-flag.
Let me know what you think about this! I will get back to your idea at https://github.com/karmada-io/karmada/pull/5116#discussion_r1685365859 in a bit. Thanks!
The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.
Our idea was to preserve only the metadata of the state rather than the actual state which may be large and be of different formats in the resourcebinding.
As you know, we use Kyverno for fetching the actual state using the persisted metadata in the resourcebinding. My understanding is that Kyverno is well supported by Karmada so we could use Kyverno to do this injection. It also allows a lot of customization that the user can perform to suit their needs.
Is your idea to have another component (maybe the scheduler) do this injection? My only concern is that customization may become difficult.
@mszacillo @Dyex719 I tried to design the API and asked @XiShanYongYe-Chang for a demo, the demo shows it is feasible. I hope we can demonstrate the demo at the next meeting. And I also hope you can help to take a look it.
API Design: https://github.com/karmada-io/karmada/compare/master...RainbowMango:karmada:api_draft_application_failover Demo: https://github.com/karmada-io/karmada/compare/master...XiShanYongYe-Chang:karmada:api_draft_application_failover
@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.
@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.
Okay, let me show you the demo during the meeting tonight.
Let me describe the demo in words.
Create each resource according to the following configuration file:
unfold me to see the yaml
# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
===================================
# pp.yaml
# propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraints
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx-propagation
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- member1
- member2
spreadConstraints:
- spreadByField: cluster
maxGroups: 1
minGroups: 1
propagateDeps: true
failover:
application:
decisionConditions:
tolerationSeconds: 60
purgeMode: Graciously
statePreservation:
rules:
- aliasLabelName: cztest-replicas
jsonPath: ".updatedReplicas"
===================================
# op.yaml
# Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
name: nginx-op
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
overrideRules:
- targetCluster:
clusterNames:
- member1
overriders:
imageOverrider:
- component: "Registry"
operator: replace
value: "fake"
Execute the following commands to deploy the above resources to the karmada control plane:
kubectl --context karmada-apiserver apply -f op.yaml
kubectl --context karmada-apiserver apply -f pp.yaml
kubectl --context karmada-apiserver apply -f deployment.yaml
The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover.
Then execute the following command to watch the deployment resource on the member2 cluster:
kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w
You will observe that the target cztest-replicas: "2" appears in the label of the deployment nginx resource on the member2 cluster.
Hi Hongcai and Chang,
Thanks for the heads up! I'm a little surprised that there has been a different branch created for the change - as this is something we had proposed and put up for review (for reference we currently use the failover feature in our own DEV setup), but thank you for putting so much thought into this feature.
Perhaps we can discuss the implementation process more during the meeting?
Cheers.
On Tue, Oct 15, 2024 at 9:08 AM Chang @.***> wrote:
Let me describe the demo in words.
Create each resource according to the following configuration file: unfold me to see the yaml
deploy.yamlapiVersion: apps/v1kind: Deploymentmetadata:
name: nginx labels: app: nginxspec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx name: nginx===================================# pp.yaml # propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraintsapiVersion: policy.karmada.io/v1alpha1kind: PropagationPolicymetadata: name: nginx-propagationspec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx placement: clusterAffinity: clusterNames: - member1 - member2 spreadConstraints: - spreadByField: cluster maxGroups: 1 minGroups: 1 propagateDeps: true failover: application: decisionConditions: tolerationSeconds: 60 purgeMode: Graciously statePreservation: rules: - aliasLabelName: cztest-replicas jsonPath: ".updatedReplicas"===================================# op.yaml # Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1apiVersion: policy.karmada.io/v1alpha1kind: OverridePolicymetadata: name: nginx-opspec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx overrideRules: - targetCluster: clusterNames: - member1 overriders: imageOverrider: - component: "Registry" operator: replace value: "fake"
Execute the following commands to deploy the above resources to the karmada control plane:
kubectl --context karmada-apiserver apply -f op.yaml kubectl --context karmada-apiserver apply -f pp.yaml kubectl --context karmada-apiserver apply -f deployment.yaml
The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover.
Then execute the following command to watch the deployment resource on the member2 cluster:
kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w
You will observe that the target cztest-replicas: "2" appears in the label of the deployment nginx resource on the member2 cluster.
image.png (view on web) https://github.com/user-attachments/assets/acafcbe7-cfdb-4a6d-b677-7b266004dbdc
— Reply to this email directly, view it on GitHub https://github.com/karmada-io/karmada/pull/5116#issuecomment-2413870940, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTHQ2PDV52RS6MWFDTZ7DDZ3UHV5AVCNFSM6AAAAABKGBCDCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3TAOJUGA . You are receiving this because you were mentioned.Message ID: @.***>
I'm a little surprised that there has been a different branch created for the change - as this is something we had proposed and put up for review (for reference we currently use the failover feature in our own DEV setup), but thank you for putting so much thought into this feature.
Thanks for letting me know! We discussed this proposal a lot. I want to help move forward with the feature, so I tried to refine the API design based on your idea. The new branch was just used to explain my thoughts. I'm looking forward to hearing your thoughts, and if we reach a consensus, they should be put in this proposal.
Hi @Dyex719, @mszacillo I'm reviewing this feature to identify any remaining tasks that need to be completed, and then determine where to start from. I wonder if you still want to refresh this proposal according to what we have done in the last release.
My two cents:
- This proposal should focus on application failover instead of
cluster failoverwhich we can start another feature/proposal. - For the failover history item part, I think it can also included in the cluster failover feature.
By the way, I and @XiShanYongYe-Chang currently working on the cluster failover feature enhancement, will let you updated once we work out some ideas.
Hi @RainbowMango,
I'll go ahead and make edits to the proposal to match up with the API changes that have been made as part of #5788. I'm happy to keep this proposal application failover specific - in the meantime we'll maintain a small internal commit that allows us to re-use the graceful eviction task logic for cluster failover. The commit is primarily changes to the related test files, the actual code change is gated behind the failover feature flag and is ~20 lines.
And sounds good! Please keep me updated about cluster failover design, as we are planning on using the feature.
The application failover should be part of cluster failover, we will revisit the entire feature in the near future, at which time we will re-describe the feature in the design doc or website documentation. So, I think we can close this PR for now, however if it can be updated to align with the current implementation, it can be reopened and merged at any time.
/close
@RainbowMango: Closed this PR.
In response to this:
The application failover should be part of cluster failover, we will revisit the entire feature in the near future, at which time we will re-describe the feature in the design doc or website documentation. So, I think we can close this PR for now, however if it can be updated to align with the current implementation, it can be reopened and merged at any time.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.