backup-restore-operator icon indicating copy to clipboard operation
backup-restore-operator copied to clipboard

Restore renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace

Open hwo-wd opened this issue 5 months ago • 3 comments

Rancher Server Setup

  • Rancher version: 2.9.1
  • Installation option (Docker install/Helm Chart): helm, 104.0.1+up5.0.1
  • Kubernetes Version and Engine: v1.28.12, rke1

Describe the bug Creating a provisioning.v2 cluster (e.g., via gitops) in a namespace different than fleet-default, creating a backup, pruning any Rancher resources and then restoring leads to said cluster being in an irrecoverably (?) state:

Rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.

To Reproduce Steps to reproduce the behavior:

  1. Create a cluster not residing in the fleet-default namespace and let it be provisioned using CAPI
  2. Create a backup
  3. Cleanup any Rancher resources: kubectl apply -f https://raw.githubusercontent.com/rancher/rancher-cleanup/main/deploy/rancher-cleanup.yaml; note that this will delete the machine-plan secret, even-though it resides in a non-Rancher-default namespace
  4. Restore from the backup created in 2. above
  5. Observe the error that the downstream cluster's system agent (which is working just fine) is no longer able to connect to upstream; after investigation this stems due to the -machine-plan$ secret not residing in the fleet-default namespace.

A possible fix, which is tough to maintain over time until #487 gets a thing, is to broaden the backup of the machine-plan by creating a new ResourceSet and enhancing the existing one by the following:

- apiVersion: v1
  kindsRegexp: ^secrets$
  namespaceRegexp: "^.*"
  resourceNameRegexp: machine-plan$|rke-state$|machine-state$|machine-driver-secret$|machine-provision$|^harvesterconfig|^registryconfig-auth

This way, the important machine-plan secret is part of the backup and gets restored, thus the downstream cluster system agent can connect just fine.

Expected behavior

  • machine-plan secrets are essential and should be backed up independent of the namespace they reside in

Note: I'd be happy to contribute a PR, I just don't know whether the namespaceRegexp: "^.*" might be too generic in your taste, albeit the resource selectors are still quite specific

hwo-wd avatar Sep 09 '24 11:09 hwo-wd