snapscheduler
snapscheduler copied to clipboard
Retention policy removes last valid snapshot, leaving no possibility of recovery
Describe the bug
VolumeSnapshot has the .status.readyToUse flag which indicates if a snapshot is ready to be used to restore a volume.
snapscheduler does not take this flag into account when deciding weather the maxCount retention has been reached.
This results in the loss of the last opportunity for recovery.
Steps to reproduce in GKE(in my case v1.28.11) with snapscheduler(v3.4.0) installed:
- create PVC:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: snapscheduler-test namespace: default labels: snapscheduler-test: "true" spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: standard-rwo - run some pod with new pvc in order to create the volume:
$ kubectl -n default run -it --rm snapscheduler-test --image=gcr.io/distroless/static-debian12 --overrides='{"spec": {"restartPolicy": "Never", "volumes": [{"name": "pvc", "persistentVolumeClaim":{"claimName": "snapscheduler-test"}}]}}' -- sh - create SnapshotSchedule:
apiVersion: snapscheduler.backube/v1
kind: SnapshotSchedule
metadata:
name: snapscheduler-test
namespace: default
spec:
claimSelector:
matchLabels:
snapscheduler-test: "true"
retention:
maxCount: 3
schedule: "*/5 * * * *"
- wait 5-10 minutes, make sure that volumeshapshots successfully creating:
$ kubectl -n default get volumesnapshot NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE snapscheduler-test-snapscheduler-test-202408301525 true snapscheduler-test 1Gi p2p-csi snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821 87s 2m6s - remove compute disk in GCP (via WebUI or gcloud command) -- human error had happened :
$ pv=$(kubectl -n default get pvc snapscheduler-test -ojsonpath='{.spec.volumeName}') $ zone=$(gcloud --project=$GCP_PROJECT compute disks list --filter="name=($pv)"|grep pvc|awk '{print $2}') $ gcloud --project p2p-data-warehouse compute disks delete $pv --zone $zone - after 10 minutes there are two volumesnapshots with
readytouse=false:$ kubectl -n default get volumesnapshot NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE snapscheduler-test-snapscheduler-test-202408301525 true snapscheduler-test 1Gi p2p-csi snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821 10m 11m snapscheduler-test-snapscheduler-test-202408301530 false snapscheduler-test p2p-csi snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a 6m38s snapscheduler-test-snapscheduler-test-202408301535 false snapscheduler-test p2p-csi snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3 98s - after 15 minutes we don't have any valid snapshot anymore (
maxCount: 3retention policy)$ kubectl -n default get volumesnapshot NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE snapscheduler-test-snapscheduler-test-202408301530 false snapscheduler-test p2p-csi snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a 13m snapscheduler-test-snapscheduler-test-202408301535 false snapscheduler-test p2p-csi snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3 8m6s snapscheduler-test-snapscheduler-test-202408301540 false snapscheduler-test p2p-csi snapcontent-b6113f79-3219-435d-8321-812ddc096154 3m6s
Expected behavior
❗ retention policy must not take into account VolumeSnapshots with .status.readyToUse==false.
❔ if possible, create a new snapshot only after the previous one has entered the ready state
Actual results retention policy removes last valid snapshot, leaving no possibility of recovery
Additional context
I agree... that's not good. I'm happy to have thoughts/suggestions on a good fix.
A few ideas:
- Only count readyToUse snapshots when implementing the cleanup policy This runs the risk of creating an unbounded number of (unready) snapshots, potentially consuming all available space (or excessive expense)
- Skip the next snapshot if the previous is not ready This will cause problems for environments where it takes a long time for the snapshot to become ready (e.g., AWS), causing SnapScheduler to miss intervals
- If the policy determines that a snapshot should be deleted, we delete unready snapshots (starting with the oldest) before ready ones. This has the same problem as (2) in being unable to handle intervals that are less than the time for a snapshot to become ready.