velero icon indicating copy to clipboard operation
velero copied to clipboard

Add design for Extending VolumePolicies to support more actions

Open shubham-pampattiwar opened this issue 1 year ago • 8 comments

This PR adds design for https://github.com/vmware-tanzu/velero/issues/6640

Please indicate you've done the following:

  • [x] Accepted the DCO. Commits without the DCO will delay acceptance.
  • [x] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • [ ] Updated the corresponding documentation in site/content/docs/main.

shubham-pampattiwar avatar Oct 13 '23 19:10 shubham-pampattiwar

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 61.88%. Comparing base (e65ef28) to head (bec3425). Report is 160 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6956      +/-   ##
==========================================
+ Coverage   61.76%   61.88%   +0.12%     
==========================================
  Files         262      266       +4     
  Lines       28440    29456    +1016     
==========================================
+ Hits        17567    18230     +663     
- Misses       9642     9934     +292     
- Partials     1231     1292      +61     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 13 '23 19:10 codecov[bot]

It sounds like we need to completely step back. @shubham-pampattiwar has been designing for one set of requirements, but the other reviews seem to have a completely different set of requirements in mind. So I think we have to step away from design for a bit and go back to requirements. I think we have a couple of options here:

  1. Provide an alternative to the existing per-pod-volume annotations to determine when to use fs-backup. This is what the current design is based on -- workflow is unchanged from current annotations, we're just aiming to provide a more user-friendly means of selection.

  2. Provide a completely different selection mechanism that relies 100% on VolumePolicies and doesn't interact with defaultVolumesToFsBackup or the annotations at all. This is what the other reviewers seem to want here.

  3. will be the least risky to implement with the least amount of effort. 2) Might result in a more flexible approach, but it's completely incompatible with the legacy methods, so we would need validation to ensure that only one approach is used.

To expand on the validation needs -- backup.Spec.DefaultVolumesToFsBackup is a bool pointer. If this is set to true or false, then we're using the legacy opt-in/opt-out approach, and any volume policies referencing fs backup or snapshots should result in a validation error -- if this is true or false, then existing workflows will be used. If this is nil (unspecified), then volume policies are used to determine what to snapshot and what to use fsbackup for. We'll need to modify install/server to make this value a bool pointer instead of a plain bool -- currently, not specifying this on install/server results in the backup getting a false rather than a nil value here.

@shubham-pampattiwar what do you think?

sseago avatar Oct 31 '23 13:10 sseago

@reasonerjt @qiuming-best @sseago I have re-done the design, please take another look, Thanks !

shubham-pampattiwar avatar Jan 16 '24 23:01 shubham-pampattiwar

Some notes from latest community call: https://hackmd.io/Jq6F5zqZR7S80CeDWUklkA#Discussion-topics

My personal understanding from the call last night is that:

  1. VolumePolicies take precedence over legecy opt-in/opt-out labeling approaches
  2. If there is no matching VolumePolicy on a volume, the velero defers to the opt-in/opt-out approach
  3. Finally if there is no match for a volumePolicy and no match for opt-in/out method, the velero server install default (FSBackupOrSnapShot) will be authoritative.

@shubham-pampattiwar I'm off base I can edit or delete this comment.

weshayutin avatar Jan 31 '24 14:01 weshayutin

Updated the PR based on the discussions in community call.

shubham-pampattiwar avatar Jan 31 '24 20:01 shubham-pampattiwar

@weshayutin "3. Finally if there is no match for a volumePolicy and no match for opt-in/out method, the velero server install default (FSBackupOrSnapShot) will be authoritative."

This is not quite correct. the defaultVolumesToFsBackup field is part of the opt-in/opt-out approach described in 2. So your first two bullets are the entirety of it. "defaultVolumesToFsBackup=true" -> "opt-out", "defaultVolumesToFsBackup=false" -> "opt-in"

sseago avatar Feb 01 '24 13:02 sseago

@Lyndon-Li @reasonerjt @sseago Updated design sections (High-level, Detailed Design and Implementation) to reflect changes needed to account for snapshot action for csi snapshot PVCs, PTAL ! Thank you !

shubham-pampattiwar avatar Feb 28 '24 19:02 shubham-pampattiwar

@reasonerjt Updated design implementation to use a volumehelper package, PTAL, Thanks !

shubham-pampattiwar avatar Mar 08 '24 19:03 shubham-pampattiwar

@shubham-pampattiwar I see there are still some opening comments, could you help to double check them? For the ones that are done, please resolve them; for the open ones, let's discuss in tomorrow's community meeting. Thanks!

Lyndon-Li avatar Mar 26 '24 11:03 Lyndon-Li

@Lyndon-Li Done, updated the design and resolved the comments. @reasonerjt Latest iteration of design updates the implementation parts and introduces a VolumeHelper interface and a VolumePolicyHelper struct which implements the VolumeHelper interface.

shubham-pampattiwar avatar Mar 26 '24 20:03 shubham-pampattiwar

Thanks for working on this (and Velero). I greatly appreciate the backups Velero is managing for me. I came here via the issue referenced above.

For my use cases the fs-backup feature is fine for many use cases (and I appreciate that it's explicit and the configuration is close to the workload configuration).

I use snapshot backups for some workloads where I need a "as if the computer crashed" consistency of the file system and I use data mover to move the data off the infrastructure the cluster is running in. This PR doesn't seem to support that use case.

The feature I'm missing currently (and I can't figure out from the resource filtering documentation is doing volume snapshots (and data mover) on some PVCs in a namespace instead of all.

For example I have a namespace running clickhouse databases and it's a mix of 1TB data that gets replaced within a day and about 100GB data that changes much less (and is more valuable). I'd like to do snapshots and data mover on the 100GB data without also having to copy the 1TB other data. (I can't easily divide them between namespaces as clickhouse or the CH operator has features that helps me generate and update the small data set from the large data set within the cluster).

abh avatar Mar 29 '24 06:03 abh

@abh Do you mean you want to specify data mover as the backup type for some of the volumes? I think this has been discussed --- only fs-backup vs. snapshot could be specified per volume; then if snapshot is specified, you need to use the existing per backup option --snapshot-move-data to make it for data mover. For sure, this would miss one functionality --- in a backup, among the volumes going with snapshot, I want some volumes to go with data mover, some not. This may be covered by future extension of this design.

@shubham-pampattiwar Please correct me if my understanding is wrong.

Lyndon-Li avatar Mar 29 '24 08:03 Lyndon-Li

@asaf-erlich If your goal is to back up with a mixture of fs-backup and datamover, that is absolutely supported by this design. Policies will identify volumes for fs-backup and snapshot, and setting --snapshot-move-data=true tells velero to use datamover for all snapshotted volumes. The only use case that's not supported in this design is if you want to use CSI snapshots without data movement for some volumes (leaving VolumeSnapshotContents in the cluster) and CSI snapshots with data movement for others, but from the use case you described, that doesn't seem relevant here.

sseago avatar Mar 29 '24 13:03 sseago

@reasonerjt PTAL, Thanks !

shubham-pampattiwar avatar Apr 01 '24 22:04 shubham-pampattiwar

Do you mean you want to specify data mover as the backup type for some of the volumes?

For me the important part is to be able to choose which volumes are being backed up. How I understand the current settings I can only choose by namespace when I want snapshot+data-mover.

In my use case I want all the snapshots to be with data mover (and I don't think there are any scenarios where fs-backup is better if snapshot+data mover is possible?).

For my use cases having the label on the pod (or on the PVC) is useful in many scenarios because it's so explicit. If the configuration is more complicated I'd worry I mistype a glob match or a label selector and accidentally don't have backups.

abh avatar Apr 02 '24 00:04 abh

@reasonerjt Thanks for the review, need a re-ack, fixed CI errors related to codespell. @sseago Please take another look, Thank you !

shubham-pampattiwar avatar Apr 02 '24 16:04 shubham-pampattiwar