velero Backup with CSI partiallyFails with a large number of PVCs

What steps did you take and what happened: While completing a larger backup with the CSI plugin (1000 PVCs) the backup partiallyFailed with error:

time="2022-09-21T19:36:03Z" level=error msg="fail to recreate VolumeSnapshotContent snapcontent-72e8f17f-bd98-42cf-b039-7d353057e440: fail to retrieve VolumeSnapshotContent snapcontent-72e8f17f-bd98-42cf-b039-7d353057e440 info: timed out waiting for the condition" backup=openshift-adp/backuptest logSource="pkg/controller/backup_controller.go:974"

What did you expect to happen: Backup completed

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

After some investigating, I suspect the issue is here: https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_controller.go#L1014

A new image was built with this value changed to 5 minutes, and the backup was able to progress further than the initial backup before partially failing (~500 VSCs to ~900 VSCs).

This image was then updated to a 10 minute timeout, and the backup then completed. This was tested several times to verify.

The timeout here may need to be configurable. CSISnapshotTimeout could possibly be re-used for this case.

Environment:

Velero version (use velero version):
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

Sep 30 '22 15:09 eemcmullan

The error message "fail to retrieve VolumeSnapshotContent" at: https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_controller.go#L1035 seems misleading. The code above is trying to validate that VolumeSnapshotContent resource is indeed deleted so if the resource is still present after the time out, the error should really say something like "VolumeSnapshotContent still exists".

Apart from that, it seems deletion of this resource is taking longer than expected in your case. This could be simply due to load on the kube-apiserver. While I agree that configurable timeout would be useful, I don't think CSISnapshotTimeout is appropriate as it has a very specific meaning.

Sep 30 '22 19:09 draghuram

@draghuram Sounds like you're agreeing that we should make it configurable but that we ought to use a different (new) timeout parameter rather than reuse an existing one -- is that correct? I guess what would be relevant here is are the sorts of cluster scenarios where we'd want to increase/decrease the CSISnapshotTimeout the same ones as would indicate the need for this timeout to be changed, or are they affected by completely different factors -- which would mean we might want them to have different values.

Sep 30 '22 19:09 sseago

Correct. In this particular case, the resource didn't disappear within the timeout and that could have happened for many reasons. CSISnapshotTimeout is to check whether VolumeSnapshot moved to ready state so that doesn't sound appropriate here.

Sep 30 '22 20:09 draghuram

velero involves a lot of interaction with api-server it's not realistic we make each timeout configurable.

I think we should investigate why deleting a VSC can take so long.
Is it expected that removing one VSC will take longer time where there are a large number of VSC existing? Or is it b/c there are too many concurrent deletion happening here: https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_controller.go#L957

If this is the root cause we should consider grouping the deletions to lower the concurrent number.

Oct 10 '22 06:10 reasonerjt

I'm not sure it's realistic to expect that hard-coded timeouts will never fail us, either. Having a few more (optionally) configurable timeouts might allow users to succeed in backing things up in cases where current code might fail them. Obviously we should investigate any case where actions are taking too long for default timeouts -- we may want to add to docs that anytime a user has to increase the timeout they should open an issue so we could investigate. At the same time, it's not reasonable for a slower-than-average apiserver in a cluster to absolutely prevent a user from backing up workloads because we didn't want to add a config value.

Oct 10 '22 13:10 sseago

@sseago Thanks for the comment.

At the same time, it's not reasonable for a slower-than-average apiserver in a cluster to absolutely prevent a user from backing up workloads because we didn't want to add a config value.

That's a valid point. To handle the general issue for a slower-than-average apiserver, how about we just allow the user to configure the "default timeout" rather than exposing too many different configuration values?

What I hope we could avoid is exposing different values like backup-deletion-timeout pvb-deletion-timeout pvb-update-timeout ..... where they can be replaced by one single setting.

For this particular case, it does seem more like a scalability issue.

It's OK to add one timeout configuration, but in the future, we should check if this configuration can be reused in other scenarios.

Oct 12 '22 10:10 reasonerjt

The problem is that we're not using the same "default timeout" for everything right now. We were using 1 minute for some things, 10 minutes for others, etc. I think we're mostly in agreement here in that we don't really want a separate timeout for everything. We had already talked about grouping related wait situations for which the same timeout should be used for multiple things. But for example, we were at one point waiting only one minute for restic repositories to be ready (I think we bumped that up to 1), but an hour for restic backup/restore, etc.

So yes, I think we all agree we want to reuse timeout configuration where possible. If all of the remaining hard-coded timeouts can be handled with one value, that's great. If anything that's currently set to 5 or 10 minutes, etc. can use a shared (configurable) timeout, then that's fine for me. The main thing is we don't want any of these values to be hard-coded and completely un-modifiable by users without creating a custom velero image.

Oct 12 '22 16:10 sseago

Hi @sseago @reasonerjt ,

I'm facing exactly the same issue.

Also when i run a "velero describe" on my backup i'm getting:

Total items to be backed up:  12445
Items backed up:              12445

is this sufficient to consider my backup completed or the "PartiallyFailed" (due to the "fail to retrieve VolumeSnapshotContent" issue ) is relevant ?

Jan 12 '23 09:01 cguermi

velero velero copied to clipboard

Backup with CSI partiallyFails with a large number of PVCs

velero
velero copied to clipboard