karmada Stateful Failover Proposal

What type of PR is this? Proposal for stateful failover

What this PR does / why we need it: Explained in the doc

Which issue(s) this PR fixes: Fixes #5006, #4969

Special notes for your reviewer: N/A

Does this PR introduce a user-facing change?: NONE

Jul 01 '24 17:07 Dyex719

Welcome @Dyex719! It looks like this is your first PR to karmada-io/karmada 🎉

Jul 01 '24 17:07 karmada-bot

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 40.91%. Comparing base (2271a41) to head (fd35fb4). Report is 680 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5116       +/-   ##
===========================================
+ Coverage   28.21%   40.91%   +12.69%     
===========================================
  Files         632      650       +18     
  Lines       43556    55182    +11626     
===========================================
+ Hits        12291    22575    +10284     
- Misses      30368    31170      +802     
- Partials      897     1437      +540

Flag	Coverage Δ
unittests	`40.91% <ø> (+12.69%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jul 01 '24 18:07 codecov-commenter

Thanks @RainbowMango! One assumption that we wanted to discuss about (I wrote this in the proposal too):

All the replicas of the stateful application are not migrated together, it is not clear when the state needs to be restored. In this proposal we focus on the use case where all the replicas of a stateful application are migrated together.

Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.

Jul 03 '24 14:07 Dyex719

Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.

I 100% agree.

Partially replica migration is not technically failover, but more like elastic scaling of replicas.
The conditions under which failover triggers are fully configurable, if people tolerate the failure of part of replicas, then migration should not be triggered, if they don't, the application should be rebuilt, by leveraging failover.

Jul 04 '24 01:07 RainbowMango

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 10 '24 17:07 karmada-bot

Hi @Dyex719, @mszacillo, I've been thinking this feature recently and came up with some ideas, this feature consists of 3 parts:

The first part is how to declare which state data(fields) should be preserved during failover, and I post my idea at https://github.com/karmada-io/karmada/pull/5116#discussion_r1685365859. Please take a look.

The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.

I think it is worth thinking about storing them in the GracefulEvictionTask, during the failover process, just before creating the eviction task, it makes sense to grab some state(fields) or snapshot of scheduled cluster list. With this grabbed data, the controller would get to know which cluster is the destination by comparing the snapshot and the newly scheduled cluster.

[edit] I don't mean I don't like the first approach, just raise another idea. Maybe we also can add a snapshot of the scheduled cluster to the history item. The most challenging thing for this part is distinguishing which field should be managed by which component(scheduler, or controller), and they should be decoupled.

The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.

Jul 20 '24 10:07 RainbowMango

Hi @RainbowMango,

To address your comment:

The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.

One thing we were thinking about it is to only store the cluster from which the workload failed over from. This would help keep the logic simple without involving multiple components. The current cluster the workload is scheduled on can always be inferred from the resourcebinding.

With this we would achieve:

spec:
  clusters:
  - name: member3
    replicas: 2
...
failoverHistory:
    - failoverTime: "2024-07-18T19:03:06Z"
      originCluster: member1
      reason: "applicationFailover"
    - failoverTime: "2024-07-18T19:08:45Z"
      originCluster: member2
      reason: "clusterFailover"

The entire trace of the workload could still be inferred from the current state + failoverHistory, the workload was originally on member1, then it migrated to member2, then it moved to member3 which can be inferred from spec.clusters in the resource binding where it is currently scheduled.

Work in progress implementation is available here: https://github.com/karmada-io/karmada/compare/master...Dyex719:karmada:stateful-failover-flag.

Let me know what you think about this! I will get back to your idea at https://github.com/karmada-io/karmada/pull/5116#discussion_r1685365859 in a bit. Thanks!

Jul 20 '24 19:07 Dyex719

The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.

Our idea was to preserve only the metadata of the state rather than the actual state which may be large and be of different formats in the resourcebinding.

As you know, we use Kyverno for fetching the actual state using the persisted metadata in the resourcebinding. My understanding is that Kyverno is well supported by Karmada so we could use Kyverno to do this injection. It also allows a lot of customization that the user can perform to suit their needs.

Is your idea to have another component (maybe the scheduler) do this injection? My only concern is that customization may become difficult.

Jul 20 '24 19:07 Dyex719

@mszacillo @Dyex719 I tried to design the API and asked @XiShanYongYe-Chang for a demo, the demo shows it is feasible. I hope we can demonstrate the demo at the next meeting. And I also hope you can help to take a look it.

API Design: https://github.com/karmada-io/karmada/compare/master...RainbowMango:karmada:api_draft_application_failover Demo: https://github.com/karmada-io/karmada/compare/master...XiShanYongYe-Chang:karmada:api_draft_application_failover

Sep 27 '24 11:09 RainbowMango

@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.

Oct 15 '24 08:10 RainbowMango

@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.

Okay, let me show you the demo during the meeting tonight.

Oct 15 '24 12:10 XiShanYongYe-Chang

Let me describe the demo in words.

Create each resource according to the following configuration file:

unfold me to see the yaml

# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
===================================
# pp.yaml 
# propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraints
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    spreadConstraints:
      - spreadByField: cluster
        maxGroups: 1
        minGroups: 1
  propagateDeps: true
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 60
      purgeMode: Graciously
      statePreservation:
        rules:
          - aliasLabelName: cztest-replicas
            jsonPath: ".updatedReplicas"
===================================
# op.yaml 
# Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
  name: nginx-op
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  overrideRules:
    - targetCluster:
        clusterNames:
          - member1
      overriders:
        imageOverrider:
          - component: "Registry"
            operator: replace
            value: "fake"

Execute the following commands to deploy the above resources to the karmada control plane:

 kubectl --context karmada-apiserver apply -f op.yaml
 kubectl --context karmada-apiserver apply -f pp.yaml
 kubectl --context karmada-apiserver apply -f deployment.yaml

The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover.

Then execute the following command to watch the deployment resource on the member2 cluster:

 kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w

You will observe that the target cztest-replicas: "2" appears in the label of the deployment nginx resource on the member2 cluster.

Oct 15 '24 13:10 XiShanYongYe-Chang

Hi Hongcai and Chang,

Thanks for the heads up! I'm a little surprised that there has been a different branch created for the change - as this is something we had proposed and put up for review (for reference we currently use the failover feature in our own DEV setup), but thank you for putting so much thought into this feature.

Perhaps we can discuss the implementation process more during the meeting?

Cheers.

On Tue, Oct 15, 2024 at 9:08 AM Chang @.***> wrote:

Let me describe the demo in words.

Create each resource according to the following configuration file: unfold me to see the yaml

deploy.yamlapiVersion: apps/v1kind: Deploymentmetadata:

name: nginx labels: app: nginxspec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx name: nginx===================================# pp.yaml # propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraintsapiVersion: policy.karmada.io/v1alpha1kind: PropagationPolicymetadata: name: nginx-propagationspec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx placement: clusterAffinity: clusterNames: - member1 - member2 spreadConstraints: - spreadByField: cluster maxGroups: 1 minGroups: 1 propagateDeps: true failover: application: decisionConditions: tolerationSeconds: 60 purgeMode: Graciously statePreservation: rules: - aliasLabelName: cztest-replicas jsonPath: ".updatedReplicas"===================================# op.yaml # Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1apiVersion: policy.karmada.io/v1alpha1kind: OverridePolicymetadata: name: nginx-opspec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx overrideRules: - targetCluster: clusterNames: - member1 overriders: imageOverrider: - component: "Registry" operator: replace value: "fake"

Execute the following commands to deploy the above resources to the karmada control plane:

kubectl --context karmada-apiserver apply -f op.yaml kubectl --context karmada-apiserver apply -f pp.yaml kubectl --context karmada-apiserver apply -f deployment.yaml

The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover.

Then execute the following command to watch the deployment resource on the member2 cluster:

kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w

You will observe that the target cztest-replicas: "2" appears in the label of the deployment nginx resource on the member2 cluster.

image.png (view on web) https://github.com/user-attachments/assets/acafcbe7-cfdb-4a6d-b677-7b266004dbdc

— Reply to this email directly, view it on GitHub https://github.com/karmada-io/karmada/pull/5116#issuecomment-2413870940, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTHQ2PDV52RS6MWFDTZ7DDZ3UHV5AVCNFSM6AAAAABKGBCDCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3TAOJUGA . You are receiving this because you were mentioned.Message ID: @.***>

Oct 15 '24 13:10 mszacillo

I'm a little surprised that there has been a different branch created for the change - as this is something we had proposed and put up for review (for reference we currently use the failover feature in our own DEV setup), but thank you for putting so much thought into this feature.

Thanks for letting me know! We discussed this proposal a lot. I want to help move forward with the feature, so I tried to refine the API design based on your idea. The new branch was just used to explain my thoughts. I'm looking forward to hearing your thoughts, and if we reach a consensus, they should be put in this proposal.

Oct 17 '24 06:10 RainbowMango

Hi @Dyex719, @mszacillo I'm reviewing this feature to identify any remaining tasks that need to be completed, and then determine where to start from. I wonder if you still want to refresh this proposal according to what we have done in the last release.

My two cents:

This proposal should focus on application failover instead of cluster failover which we can start another feature/proposal.
For the failover history item part, I think it can also included in the cluster failover feature.

By the way, I and @XiShanYongYe-Chang currently working on the cluster failover feature enhancement, will let you updated once we work out some ideas.

Jan 03 '25 03:01 RainbowMango

Hi @RainbowMango,

I'll go ahead and make edits to the proposal to match up with the API changes that have been made as part of #5788. I'm happy to keep this proposal application failover specific - in the meantime we'll maintain a small internal commit that allows us to re-use the graceful eviction task logic for cluster failover. The commit is primarily changes to the related test files, the actual code change is gated behind the failover feature flag and is ~20 lines.

And sounds good! Please keep me updated about cluster failover design, as we are planning on using the feature.

Jan 04 '25 18:01 mszacillo

The application failover should be part of cluster failover, we will revisit the entire feature in the near future, at which time we will re-describe the feature in the design doc or website documentation. So, I think we can close this PR for now, however if it can be updated to align with the current implementation, it can be reopened and merged at any time.

/close

Jan 24 '25 07:01 RainbowMango

@RainbowMango: Closed this PR.

In response to this:

The application failover should be part of cluster failover, we will revisit the entire feature in the near future, at which time we will re-describe the feature in the design doc or website documentation. So, I think we can close this PR for now, however if it can be updated to align with the current implementation, it can be reopened and merged at any time.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 24 '25 07:01 karmada-bot