crossplane-runtime icon indicating copy to clipboard operation
crossplane-runtime copied to clipboard

Proposal: More pause options for disaster recovery control

Open chlunde opened this issue 3 years ago • 3 comments

Just writing down two related ideas here

What problem are you facing?

Disaster recover or migrating resources to other clusters is hard and scary

How could Crossplane help solve your problem?

During migration or disaster recovery, it will be difficult to set "pause" on all resources. It would be nice to pause a full provider, like a CLI argument --pause.

It would also be nice to have a pause option which would Observe but not Create/Update/Delete. This would give an operator confidence in what kinds of actions would run when the cluster is unpaused. This might be a different CLI option or annotation.

chlunde avatar Oct 06 '22 09:10 chlunde

Another way to completely disable a provider is to set replicas to 0 in the provider's ControllerConfig

bobh66 avatar Oct 06 '22 15:10 bobh66

@chlunde so we have two options for disaster recovery use-cases:

  1. As @bobh66 mentioned, setting the replicas to 0 for the ControllerConfig.
  2. Setting the pause annotation for specific resources: https://crossplane.io/docs/v1.10/concepts/managed-resources.html#pausing-reconciliations.

Would that be sufficient for your use-cases? If not would you mind elaborating why not.

luebken avatar Nov 03 '22 11:11 luebken

@luebken my main worry when doing use cases such as

  • restoring a cluster (recreate, partial restore, go back in time for a namespace) with thousands of managed resources
  • restore an external resource from backup and then restore and re-attach it to a managed resource

would be that due to some unforeseen issue:

  • many resources are doubly created, for example due to generateName we get role-HASH2 when we had role-HASH1. For example if just restoring a claim and the composition rendering does not use predicatable name/external-name.
  • resources are garbage collected, and then, deleted if we only restore managed resource without claims

So I would like to pause Create/Update/Delete but not Observe to ensure everything is as expected. Pause (as implemented today) would not give any comfort similar to a terraform plan, but this might do that.

chlunde avatar Nov 09 '22 21:11 chlunde

Now that we have Observe only resources, I don't think this is an issue. In this case we can achieve the same thing, for example with something like this:

  • Set all replicas to 0 for providers
  • Update compositions to observe only https://github.com/crossplane/crossplane/issues/1722
  • Restore from backup
  • Enable providers by setting replicas
  • Verify managed resources
  • Removing observe only in compositions

chlunde avatar Sep 02 '24 09:09 chlunde