velero icon indicating copy to clipboard operation
velero copied to clipboard

Allow customizing restore order for Kubernetes controllers and their managed resources

Open DanielXiao opened this issue 3 years ago • 3 comments

Describe the problem/challenge you have [A description of the current limitation/problem/challenge that you are experiencing.] When restore targets contain Kubernetes controllers, it 's possible to hit below issues:

  1. Velero is not aware of dependencies among Custom Resources and restore them in alphabetical order. E.g., invalid memory address or nil pointer dereference
  2. Race condition between Velero and a controller when they operate the same resource. See below issue from Antrea restore:

time="2021-08-10T16:41:04Z" level=info msg="Attempting to restore Tier: securityops" logSource="pkg/restore/restore.go:1070" restore=velero/restore-48c089d0-03ed-4f30-8532-a2e9837bea94 time="2021-08-10T16:41:04Z" level=info msg="error restoring securityops: admission webhook "tiervalidator.antrea.tanzu.vmware.com" denied the request: tier securityops priority 50 overlaps with existing Tier" logSource="pkg/restore/restore.go:1133" restore=velero/restore-48c089d0-03ed-4f30-8532-a2e9837bea9

error restoring application: admission webhook "tiervalidator.antrea.tanzu.vmware.com" denied the request: tier application priority 250 overlaps with existing Tier"

Describe the solution you'd like [A clear and concise description of what you want to happen.] From default restore order, we can see controller Pods are restored before managed Custom Resources, so we may solve this problem by:

  1. Allow user to define restore order for Custom Resource per restore.
  2. Mark controller Pod/Deployment with labels and remove them from the ordered list and append them to the end (before any managed resources).

As for Antrea cluster, the order should be default restore order -> Tier CR -> Other Antrea CRs -> Antrea controller Pod -> Antrea controller replicaset and deployment -> Antrea MutatingWebhookConfiguration and ValidatingWebhookConfiguration

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Nowadays there are tons of workloads consist of controllers and operators, both disaster recovery and migration might hit this issue.

Environment:

  • Velero version (use velero version): all
  • Kubernetes version (use kubectl version): all
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration: all
  • OS (e.g. from /etc/os-release): all

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • :+1: for "The project would be better with this feature added"
  • :-1: for "This feature will not enhance the project in a meaningful way"

DanielXiao avatar Aug 18 '21 10:08 DanielXiao