karmada Add a label/annotation to the resource being rescheduled during failover

What would you like to be added: We propose adding a label/annotation during failover so that webhooks like Kyverno can perform the necessary checks/changes before the job is rescheduled. We are also open to discussing other ideas and contributing back to the community.

Why is this needed: Stateful applications may need to read the last saved state to resume processing after failover. This may involve a change in the spec so that the path to read from can be specified. It would be useful to know when a failover happened so that stateful applications can perform the necessary checks/changes before restarting.

In our particular use-case where we are migrating Flink applications using Karmada. The step by step process would be:

Karmada deploys the FlinkDeployment CR to a member cluster
The CR fails over due to a cluster failover / application failover
The label/annotation would get added during this process (For example: "failover" : true)
When Karmada reschedules the application to a different member cluster, a webhook like Kyverno could mutate the spec by checking for this label ("failover" : true) and restart the application from the last state only if this label exists
Resume processing of the stateful application from the last state

May 22 '24 14:05 Dyex719

It seems this issue is very similar to the pause we have been discussing before. /cc @XiShanYongYe-Chang

https://github.com/karmada-io/karmada/issues/4688

https://github.com/karmada-io/karmada/issues/4421

May 27 '24 03:05 chaunceyjiang

Hi @Dyex719, thanks for your feedback.

I have a few questions. Are you referring to the labels being automatically added in by the system? When Karmada reschedules, how does the system trigger the webhook or Kverno to run?

May 27 '24 12:05 XiShanYongYe-Chang

Hi @XiShanYongYe-Chang, Thanks for getting back to us!

Are you referring to the labels being automatically added in by the system?

Yes, Karmada would add a label after failover so that the next time it is rescheduled it would have the label.

how does the system trigger the webhook or Kverno to run?

Kyverno would be deployed on the member clusters and is therefore called in the scheduling process. Kyverno would check if the label exists on the spec and if it does, read the last state accordingly.

May 27 '24 22:05 Dyex719

I'sorry, I still don't understand the whole process.

Hi @chaunceyjiang do you understand the requirement?

May 28 '24 02:05 XiShanYongYe-Chang

My understanding is that @Dyex719 wants a label to indicate that the current resource is undergoing failover. This is because he expects this transition state to be recognized by other third-party software.

If I remember correctly, when a resource is transitioning, there will be a GracefulEvictionTasks in the derived ResourceBinding to indicate the tasks currently being transferred.

May 28 '24 09:05 chaunceyjiang

If I remember correctly, when a resource is transitioning, there will be a GracefulEvictionTasks in the derived ResourceBinding to indicate the tasks currently being transferred.

Yes, the cluster to be removed will be placed here:

https://github.com/karmada-io/karmada/blob/d676996b240f43ca0430d492b8c54eb0b4986c2d/pkg/apis/work/v1alpha2/binding_types.go#L97-L111

This should be more specific than label.

May 28 '24 12:05 XiShanYongYe-Chang

Hi @XiShanYongYe-Chang and @chaunceyjiang, Thank you for the comments.

As far as I understand, using the GracefulEvictionTasks would not work in some scenarios, for example when the eviction mode is immediate (https://github.com/karmada-io/karmada/blob/d676996b240f43ca0430d492b8c54eb0b4986c2d/pkg/controllers/applicationfailover/rb_application_failover_controller.go#L176)

To decouple this, maybe we can add a field in the ResourceBindingStatus? I am including the time here as well as that could be useful but maybe it is not necessary.

type ResourceBindingStatus struct {
         ...
	// LastFailoverTime represents the latest timestamp when a workload was failed over.
	// It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
	// +optional
	LastFailoverTime *metav1.Time `json:"lastFailoverTime,omitempty"`
	...

May 28 '24 15:05 Dyex719

Thanks, I understand what you mean by this field.

One more question,

before the job is rescheduled.

Do we need to pause and wait for the verification to complete before rescheduling? Or does it mean that the rescheduling logic can be executed synchronously, regardless of whether validation is being performed?

May 29 '24 03:05 XiShanYongYe-Chang

Verification/Mutation is done on the member clusters and is performed after Karmada has rescheduled the job. The flow is:

Karmada Rescheduled job from Cluster A to Cluster B
Verification of label and reading from last state is done by Kyverno/Webhook on Cluster B
Job is restored from last state on Cluster B.

So the pausing is done by Kyverno/Webhook on the member cluster which I believe are blocking in nature.

May 29 '24 13:05 Dyex719

Thanks for your explanation @Dyex719

It looks like the Karmada control plane doesn't need to sense check and pause.

I thought that labels were added during rescheduling and cleaned up after rescheduling before. When will these labels be cleaned up, as you describe?

May 30 '24 03:05 XiShanYongYe-Chang

We didn't think about a cleanup strategy, the idea was mainly to update the label with the latest time in case of multiple failures.

The label is only added on rescheduling due to failure, and would not be added if the job is being scheduled for the first time. What are your reasons for needing a cleanup operation?

May 30 '24 11:05 Dyex719

Thanks, I get it. Invite more guys to help take a look. cc @chaunceyjiang @RainbowMango @whitewindmills @chaosi-zju

May 30 '24 12:05 XiShanYongYe-Chang

Hi @XiShanYongYe-Chang, @chaunceyjiang, thanks for taking a look at this issue!

At the moment we've been able to make a quick fix for this by altering the ResourceBindingStatus as was previously suggested. We added a method updateFailoverStatus to the rb_application_failover_controller (and will be added to cluster failover controller), which appends a new condition to the ResourceBindingStatus indicating a previous failover.

In pkg/controllers/binding/common.go we check the failover condition, and if present we append a label to the work being created during rescheduling. We would prefer using an actual field in the status rather than a condition, but for some reason I had difficulty getting the LastFailoverTime status to update correctly. Will need to investigate this.

You can view the code here: https://github.com/karmada-io/karmada/compare/master...mszacillo:karmada:failover-label, but please note this is not PR ready and was mostly just a quick fix for testing on our end!

May 30 '24 13:05 mszacillo

Hi @mszacillo @Dyex719 thanks

Can I ask conveniently, which company are you from? Have you started using Karmada?

May 31 '24 01:05 XiShanYongYe-Chang

No problem. We're from Bloomberg - at the moment we aren't using Karmada in production environments, but we are investigating the work necessary to get Karmada working for stateful application failover. Perhaps this is something we can discuss more during the community meeting.

May 31 '24 12:05 mszacillo

@mszacillo Thanks a lot~

Jun 03 '24 01:06 XiShanYongYe-Chang

I'm interested in this use case that helps stateful workloads, like FlinkDeployment, to resume processing.

When Karmada reschedules the application to a different member cluster, a webhook like Kyverno could mutate the spec by checking for this label ("failover" : true) and restart the application from the last state only if this label exists

@Dyex719 Can you share with us which FlinkDeploymentSpec filed exactly the Kyverno would mutate?

In addition, as far as I know, there will be a checkpoint storage, do you need to migrate the data across clusters?

Jun 04 '24 08:06 RainbowMango

Hi @RainbowMango,

Please refer to this section and the upgrade modes There are a few fields that would be mutated by Kyverno:

initialSavepointPath: “desired checkpoint path to be resumed from (s3p://)”
upgradeMode: savepoint 
state: running

The checkpoint storage will need to be replicated across data centers so that the state can be resumed from the last checkpoint. If this is supported no migration is needed. Essentially, the checkpoint storage should be independent of cluster failure.

Jun 04 '24 10:06 Dyex719

Thanks, the idea is impressive! Then, the Kyverno won't be involved in complex logic, like checking or syncing checkpoint and so on(that's my concern). Kyverno just needs to be aware that the FlinkDeployment in creating is migrating from another cluster, and subsequently, it will adjust accordingly based on the configuration(like labels) of the FlinkDeloyment itself.

Jun 05 '24 03:06 RainbowMango

Yup, that's exactly correct! There is a few small issues with how Flink works though -

Flink creates a new jobID every time a new job is scheduled so when migration happens another new ID is created. This is a problem because the checkpoint to restore from is of the form s3p:///jobID/<checkpoint_number>. Here the jobID to restore from is the previous jobID before migration takes place.

Since this previous jobID is not stored anywhere, we will need Karmada to carry this field over. If you think Karmada should support stateful applications like this, we can talk about how to handle such cases. I created https://github.com/karmada-io/karmada/issues/5006 to discuss about a generic framework for such stateful applications.

Jun 05 '24 12:06 Dyex719

@RainbowMango Since there are a lot of moving parts here with varying degrees of urgency, would it be recommended for us to create a proposal document with all the requirements for supporting FlinkDeployment (and other stateful) failover?

Jun 05 '24 19:06 mszacillo

@Dyex719 Yeah, I can see the jobId(.status.jobStatus.jobId) on the FlinkDeployment, but I don't know much about it. Each time FlinkDeployment launches a job, it will assign a unique ID for it, which happens to be used as the checkpoint storage path, resulting in this checkpoint being usable only by the current job. Please correct me if I'm wrong. Are there any improvements that could be made at Flink operator? Like, allow the user to specific job ID or specific checkpoint path. Just a guess. What do you think?

Jun 06 '24 06:06 RainbowMango

@mszacillo Sure! That would be great to have a proposal to address all these things!

I believe that the migration of a stateful application is a very challenging task, but it's something of great value. What I'm thinking is that perphas there are tasks that can be done before the migration, such as scalable mechanisms to allow users to do some preparatory work, and some tasks can be done after the migration, just like the approach we are talking about here. Thank you in advance. Here is the proposal template, you might need.

Jun 06 '24 07:06 RainbowMango

@RainbowMango

Each time FlinkDeployment launches a job, it will assign a unique ID for it, which happens to be used as the checkpoint storage path, resulting in this checkpoint being usable only by the current job.

This is correct.

Yeah, I can see the https://github.com/karmada-io/karmada/issues/4905#issuecomment-2097921854

As you mentioned, the job ID is only present in the status so this cannot be accessed by Kyverno as the job is not running yet. This is the main problem.

Are there any improvements that could be made at Flink operator? Like, allow the user to specific job ID or specific checkpoint path?

This is technically possible to add a static job ID to create a specific checkpoint path but is not ideal. Any sort of user error like creating a new job or restarting from a checkpoint may result in overwriting previous checkpoints which is dangerous.

There is some interest in the community to make the process of reading from a checkpoint easier, for example : https://issues.apache.org/jira/browse/FLINK-9043. However this issue has been open for a long time.

We will work on the proposal and can then hopefully talk about this later!

Jun 06 '24 14:06 Dyex719

/close

This requirement has been addressed by the #5788 and the feature has been included in release-1.12.

Jan 06 '25 07:01 RainbowMango

@RainbowMango: Closing this issue.

In response to this:

/close

This requirement has been addressed by the #5788 and the feature has been included in release-1.12.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 06 '25 07:01 karmada-bot

karmada karmada copied to clipboard

Add a label/annotation to the resource being rescheduled during failover

karmada
karmada copied to clipboard