karmada Add failover history information

What type of PR is this? /kind feature

What this PR does / why we need it: Adding failover history information so that applications can keep a record of what failovers happened in the past. Stateful applications can use this information to detect failures and continue processing from a particular state

Which issue(s) this PR fixes: Fixes #5116 #4969

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Jul 25 '24 23:07 Dyex719

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign rainbowmango for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS
pkg/apis/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 25 '24 23:07 karmada-bot

Hi All, I have implemented and tested the failover information feature and wanted to request a review from the community on what changes might be required. I notice that there are some minor linter issues that I will fix. I have also addressed the failing test case in the comment above.

Thank you!

Jul 26 '24 13:07 Dyex719

Hi @RainbowMango ! Could we get a review of this PR when you get a chance? Failures are due to the test case describes above in the comments. We can decide how to tackle this moving forward. Greatly appreciate it!

Sep 06 '24 17:09 mszacillo

Sure. And sorry for letting this sit again!

Sep 07 '24 08:09 RainbowMango

Hi @RainbowMango,

I've taken a closer look at your demo branch (https://github.com/karmada-io/karmada/compare/master...XiShanYongYe-Chang:karmada:api_draft_application_failover). I don't have any major complaints from the API side. It seems that logically this effort can be divided into two parts:

Adding FailoverHistoryItem to the resource binding API (which is handled in this PR).
Adding StatePreservation to the propagationpolicy API (which has been done in your branch).

Would you be open to dividing up the work as such? The StatePreservation work can be merged after the FailoverHistoryItem is added. I've already updated the FailoverHistoryItem API to align better with the proposed changes in your branch, but I've left the StatePreservation out as Chang seems to have implemented that already and I wouldn't consider it fair to copy his work under our PR. :)

Note: I haven't updated the codegen for this PR yet so there are some failing checks. Will get that sorted.

Oct 16 '24 22:10 mszacillo

Follow-up question related to the proposed changes for state preservation, are we planning on supporting this only for applications that failover gracefully?

In our use-case we use Immediately, since some of our users have strict only-once processing requirements. So when a failover occurs, we want to remove the application as fast as possible to reduce the likelihood of duplicate applications. Perhaps the state preservation could be kept as part of the status of the resourcebinding and referenced during the scheduling step, similar to how we handle the FailoverHistoryItem in our PR.

Oct 16 '24 22:10 mszacillo

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 65.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 42.35%. Comparing base (1cd75f4) to head (2d434bd). Report is 45 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/scheduler/scheduler.go	55.00%	9 Missing :warning:
pkg/controllers/utils/common.go	75.00%	5 Missing and 2 partials :warning:
...tionfailover/rb_application_failover_controller.go	37.50%	5 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5251      +/-   ##
==========================================
+ Coverage   41.65%   42.35%   +0.69%     
==========================================
  Files         653      657       +4     
  Lines       55533    55921     +388     
==========================================
+ Hits        23135    23686     +551     
+ Misses      30894    30715     -179     
- Partials     1504     1520      +16

Flag	Coverage Δ
unittests	`42.35% <65.00%> (+0.69%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Oct 19 '24 03:10 codecov-commenter

It seems that logically this effort can be divided into two parts: Adding FailoverHistoryItem to the resource binding API (which is handled in this PR). Adding StatePreservation to the propagationpolicy API (which has been done in your branch).

It makes sense to me.

I've already updated the FailoverHistoryItem API to align better with the proposed changes in your branch, but I've left the StatePreservation out as Chang seems to have implemented that already and I wouldn't consider it fair to copy his work under our PR. :)

As I explained at https://github.com/karmada-io/karmada/pull/5116#issuecomment-2418694001, that code in my branch is just a demo for demonstrating the general idea, and that's all based on your idea. So we would be happy if it could be accepted in this proposal.

Oct 21 '24 07:10 RainbowMango

Follow-up question related to the proposed changes for state preservation, are we planning on supporting this only for applications that failover gracefully?

I'm still thinking about this, I believe cluster failover also requires this configuration.

In our use-case we use Immediately, since some of our users have strict only-once processing requirements. So when a failover occurs, we want to remove the application as fast as possible to reduce the likelihood of duplicate applications.

Glad to know the user case, when designing this PurgeMode, there was a discussion about whether any user would want to remove the old application immediately, now I think this is a clear use case for it.

Oct 21 '24 08:10 RainbowMango

/retest Seems the failing test is unrelated. https://github.com/karmada-io/karmada/actions/runs/11465282263/job/31903724746?pr=5251

STEP: Creating Remedy(remedy-8rs8j) @ 10/22/24 17:50:43.218
  STEP: update Cluster(member1) ServiceDomainNameResolutionReady condition to false @ 10/22/24 17:50:43.225
  STEP: wait Cluster(member1) status has TrafficControl RemedyAction @ 10/22/24 17:50:43.246
  STEP: remove Remedy resource @ 10/22/24 17:50:43.26
  STEP: wait Cluster(member1) status doesn't has TrafficControl RemedyAction @ 10/22/24 17:50:43.266
  [FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/framework/cluster.go:308 @ 10/22/24 17:57:43.267
  << Timeline

  [FAILED] Timed out after 420.001s.
  Expected
      <bool>: false
  to equal
      <bool>: true
  In [It] at: /home/runner/work/karmada/karmada/test/e2e/framework/cluster.go:308 @ 10/22/24 17:57:43.267

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e/framework.WaitClusterFitWith({0x583bfc0, 0xc000c29c20}, {0xc000726fc9, 0x7}, 0x5358650)
    	/home/runner/work/karmada/karmada/test/e2e/framework/cluster.go:308 +0x23e
    github.com/karmada-io/karmada/test/e2e.init.func10.1.3.4()
    	/home/runner/work/karmada/karmada/test/e2e/remedy_test.go:121 +0x78
    github.com/karmada-io/karmada/test/e2e.init.func10.1.3()
    	/home/runner/work/karmada/karmada/test/e2e/remedy_test.go:120 +0x3f1

Oct 23 '24 07:10 RainbowMango

Hi @RainbowMango, I updated the FailoverHistory API definition to remove the fromCluster definition as it's a bit redundant. We can use the clustersBeforeFailover field as is. I've also removed the failover label from this PR and will open a follow-up PR with that implementation after this one.

The failover history should now cover various use-cases - did some e2e tests:

With replicas failing over together, from cluster B to A and then from cluster A to B.

  failoverHistory:
  - clustersAfterFailover:
    - name: cluster-A
      replicas: 3
    clustersBeforeFailover:
    - name: cluster-B
      replicas: 3
    reason: ApplicationFailure
    startTime: "2024-10-25T22:25:39Z"
  - clustersAfterFailover:
    - name: cluster-B
      replicas: 3
    clustersBeforeFailover:
    - name: cluster-A
      replicas: 3
    reason: ApplicationFailure
    startTime: "2024-10-25T22:26:10Z"

Divided replicas failing over (reschedule fails since we filter out clusters from before failover, and I was only using 2 member clusters for tests).

failoverHistory:
  - clustersAfterFailover: []
    clustersBeforeFailover:
    - name: spaas-ob-qa02
      replicas: 2
    - name: spaas-pw-qa02
      replicas: 1
    reason: ApplicationFailure
    startTime: "2024-10-25T23:19:16Z"

Oct 26 '24 00:10 mszacillo

Will need to check the failing e2e test reschedule when policy is dynamic weight schedule type, seems the filtering step I added might have broken this.

Oct 26 '24 00:10 mszacillo

I updated the FailoverHistory API definition to remove the fromCluster definition as it's a bit redundant. We can use the clustersBeforeFailover field as is.

Fair enough. From ClustersBeforeFailover and ClustersAfterFailover we can easily figure out the failover direction.

For example, we can have a helper method on FailoverHistoryItem like this:

func (history *FailoverHistoryItem) FailoverFrom() []string {
	before := sets.Set[string]{}
	after := sets.Set[string]{}

	for _, cluster := range history.ClustersBeforeFailover {
		before.Insert(cluster.Name)
	}

	for _, cluster := range history.ClustersAfterFailover {
		after.Insert(cluster.Name)
	}

	return before.Difference(after).UnsortedList()
}

Note that FailoverFrom method above returns a slice of clusters, which supports the case that failover from multiple clusters at the same time.

Oct 26 '24 03:10 RainbowMango

Divided replicas failing over (reschedule fails since we filter out clusters from before failover, and I was only using 2 member clusters for tests).

My two cents. I think history should only hold the successful failover. For those failover fails which should not have a substantial impact on the system and also should not be included in the history. What do you think?

If this is the case, we may need to consider recording the history when cleaning the eviction task. More works to do, such as:

Ensure all types of failover would start with an eviction task. (No matter what the PurgeMode it is.)
Record the history only when cleaning the eviction task, and only record the succeed one.

Oct 26 '24 04:10 RainbowMango

My two cents. I think history should only hold the successful failover. For those failover fails which should not have a substantial impact on the system and also should not be included in the history. What do you think?

I think that's fair, we can apply the full failover history once scheduling has been decided.

If this is the case, we may need to consider recording the history when cleaning the eviction task. More works to do, such as:

Ensure all types of failover would start with an eviction task. (No matter what the PurgeMode it is.)

Record the history only when cleaning the eviction task, and only record the succeed one.

I could use the eviction task, but it would need to become more general to encompass Immediately in addition to the already supported Gracefully PurgeMode. Given that the eviction tasks themselves are called GracefulEvictionTasks, this might need more thought and refactoring which I'd argue is starting to move out of scope of this PR. The other issue is if the gracePeriod is too low, the eviction task will be cleared before scheduling take place, meaning the we won't be able to construct the failoverHIstory.

What I would suggest is we could use a read-write safe failover cache which stores the ResourceBinding's Name/Namespaceas a key, with the value being[][]TargetCluster`. How it would work:

When a failover is triggered, add the ResourceBinding to the cache and append []TargetCluster representing the current evicted clusters.
During scheduling, check the cache for the ResourceBinding and take the most recent []TargetCluster to represent the clustersBeforeFailover. Clear the entry of the cache if that is the last item in the slice.

I still have to think over edge cases, but let me know your ideas.

Oct 26 '24 07:10 mszacillo

Will need to check the failing e2e test reschedule when policy is dynamic weight schedule type, seems the filtering step I added might have broken this.

Maybe you can try rebasing the PR, I see this PR was made based on code 2 weeks ago, and the test cases change over time.

Oct 26 '24 07:10 RainbowMango

I could use the eviction task, but it would need to become more general to encompass Immediately in addition to the already supported Gracefully PurgeMode. Given that the eviction tasks themselves are called GracefulEvictionTasks, this might need more thought and refactoring which I'd argue is starting to move out of scope of this PR.

Yes, it is beyond the scope of this PR. I'm still investigating it. Invite @XiShanYongYe-Chang here to confirm if there are any other concerns regarding to Ensure all types of failover would start with an eviction task. (No matter what the PurgeMode it is.).

The other issue is if the gracePeriod is too low, the eviction task will be cleared before scheduling take place, meaning the we won't be able to construct the failoverHIstory.

Good point!

Oct 26 '24 08:10 RainbowMango

What I would suggest is we could use a read-write safe failover cache which stores the ResourceBinding's Name/Namespaceas a key, with the value being[][]TargetCluster`.

Compared to the cache, I prefer using API to record the state, because it's a simple pattern, and don't have to worry about cases like component restarts and reloading. Let's see if we can address these changes before turning the direction.
Firstly, since the cache is built in karmada-controller-manager's memory, no way to share it with the karmada-scheduler. Secondly, I'm hesitant to let the karmada-scheduler involved too much.

Oct 26 '24 08:10 RainbowMango

Firstly, since the cache is built in karmada-controller-manager's memory, no way to share it with the karmada-scheduler.

Oof…good point. :(

Oct 26 '24 08:10 mszacillo

@RainbowMango Would you be opposed to making the clustersAfterFailover optional for now, while we determine the best way to update that field? As the other fields (clustersBeforeFailover included) are all pretty clear and easily set during the initial failover step.

Otherwise I've been thinking about workarounds to accurately capturing the clustersAfterFailover without relying on the scheduler, and without having to keep track of the order of failoverHistoryItems. Perhaps instead of combining clustersBeforeFailover and clustersAfterFailover, we could introduce a concept of a FailoverEvent which can take two forms (names WIP):

FailoverTriggered event -> (clustersBeforeFailover, startTime, reason)
FailoverCompleted event -> (clustersAfterFailover, endTime).

FailoverTriggered could be controlled by the existing failover controller, whereas FailoverCompleted could be controlled by a new failover controller which tracks changes in RB placement.

The benefits include the scheduler does not need to update the failover history status. It would also decouple the clustersBeforeFailover and clustersAfterFailover fields, which are difficult to track in order in cases of divided replicas. The drawback is that having different events could make readability more difficult. It would also require altering one of the existing failover controllers or adding a new controller to track failover completion.

Just throwing out an idea ~

Oct 28 '24 03:10 mszacillo

Would you be opposed to making the clustersAfterFailover optional for now, while we determine the best way to update that field? As the other fields (clustersBeforeFailover included) are all pretty clear and easily set during the initial failover step.

I've been investigating this. It seems we can generate the failover history in the resource-binding-graceful-eviction-controller when clearing the eviction tasks.

Regarding to the concern on above:

The other issue is if the gracePeriod is too low, the eviction task will be cleared before scheduling take place, meaning the we won't be able to construct the failoverHIstory.

The the resource-binding-graceful-eviction-controller guarantees the eviction task to be cleared after the scheduler react on that.

@XiShanYongYe-Chang We are going to figure out when to generate the failover history. Both application failover and cluster failover ended in the resource-binding-graceful-eviction-controller, so this might be the best place to generate the history. What do you think?

Oct 29 '24 09:10 RainbowMango

Both application failover and cluster failover ended in the resource-binding-graceful-eviction-controller, so this might be the best place to generate the history. What do you think?

If we're interested in generating the historyItem in one place then I think that's a good idea. Additionally we are reliant on a failover label to be appended to the workload in our use-case, so if we can guarantee that the eviction task is not cleaned up before scheduling, we can check the eviction reason and append the label if necessary.

Oct 29 '24 17:10 mszacillo

@RainbowMango The e2e test failure is due to interference with the existing cluster filtration step that checks clustersBeforeFailover and the workload rebalancer. Since the failoverHistory is not ephemeral, it will always filter out clusters in clustersBeforeFailover until there is a new failover. This doesn't work alongside the workload rebalancer.

Seems we should keep the existing filter logic that only checks eviction tasks (as long as we make sure these are cleaned up post scheduling).

Oct 29 '24 17:10 mszacillo

Seems we should keep the existing filter logic that only checks eviction tasks (as long as we make sure these are cleaned up post scheduling).

If that's the case, we can narrow down the scope of this PR and focus on generating the history.

Note that, currently, for purge modes Graciously and Never, we can ensure the eviction task exists before scheduling, as we will build an eviction task for them. However, we still need to take care of the Immediately mode. We can iterate it by separate PR.

Oct 30 '24 08:10 RainbowMango

/retest

Seems agents were interrupted.

Oct 30 '24 21:10 mszacillo

@mszacillo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Seems agents were interrupted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 30 '24 21:10 karmada-bot

/retest

Oct 31 '24 01:10 RainbowMango

@RainbowMango Thanks for putting together the stateful application support checklist, its very comprehensive!

I've just updated this PR to remove the last bit of the failover flag implementation, and have this PR solely focus on the API change. Running some tests for both divided and grouped replicas looked as we expect:

Single replica

failoverHistory:
  - clustersAfterFailover:
    - cluster-A
    clustersBeforeFailover:
    - cluster-B
    reason: ApplicationFailure
    startTime: "2024-11-05T19:04:14Z"
  - clustersAfterFailover:
    - cluster-B
    clustersBeforeFailover:
    - cluster-A
    reason: ApplicationFailure
    startTime: "2024-11-05T19:04:45Z"

Divided replicas

failoverHistory:
  - clustersAfterFailover:
    - cluster-A
    - cluster-B
    clustersBeforeFailover:
    - cluster-A
    - cluster-B
    reason: ApplicationFailure
    startTime: "2024-11-05T19:00:07Z"
  - clustersAfterFailover:
    - cluster-A
    - cluster-B
    clustersBeforeFailover:
    - cluster-B
    - cluster-A
    reason: ApplicationFailure
    startTime: "2024-11-05T19:00:38Z"

Note: The purgeMode for this resource was set to immediately, and since we've decided to go with the evictionTask strategy for filtering out clusters, the Karmada scheduler rescheduled the replicas to the same clusters. This should be resolved in the future. Please let me know if anything else needs to be addressed in this PR.

Cheers!

Nov 05 '24 19:11 mszacillo

Additional comment before I forget - we still append the clustersAfterFailover within the scheduler for now. As discussed, this logic should be moved into the evictionTask controller before the eviction task gets cleared. We can tackle that in an additional PR.

If you'd like to remove the setting of the failoverHistory and keep this as simply an API change, I could do that as well. Let me know.

Nov 05 '24 19:11 mszacillo

Hi @Dyex719 @mszacillo @RainbowMango

Regarding the specific implementation details of the plan, I have combined previous discussions and preliminary conclusions to conduct some analysis and organization, which should aid in the implementation of the feature. Can you help take a look?

https://docs.google.com/document/d/1Np67nVXokSCAfzrI8P8JwQgvqJF64068Yu6leDn2Kjs/edit?tab=t.0#heading=h.eyvom3e0lj8w

Nov 06 '24 02:11 XiShanYongYe-Chang