snapscheduler Retention policy removes last valid snapshot, leaving no possibility of recovery

Retention policy removes last valid snapshot, leaving no possibility of recovery

Open mnacharov opened this issue 1 year ago • 1 comments

trafficstars

Describe the bug VolumeSnapshot has the .status.readyToUse flag which indicates if a snapshot is ready to be used to restore a volume. snapscheduler does not take this flag into account when deciding weather the maxCount retention has been reached. This results in the loss of the last opportunity for recovery.

Steps to reproduce in GKE(in my case v1.28.11) with snapscheduler(v3.4.0) installed:

create PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: snapscheduler-test
  namespace: default
  labels:
    snapscheduler-test: "true"
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: standard-rwo

run some pod with new pvc in order to create the volume: $ kubectl -n default run -it --rm snapscheduler-test --image=gcr.io/distroless/static-debian12 --overrides='{"spec": {"restartPolicy": "Never", "volumes": [{"name": "pvc", "persistentVolumeClaim":{"claimName": "snapscheduler-test"}}]}}' -- sh
create SnapshotSchedule:

    apiVersion: snapscheduler.backube/v1
    kind: SnapshotSchedule
    metadata:
      name: snapscheduler-test
      namespace: default
    spec:
      claimSelector:
        matchLabels:
          snapscheduler-test: "true"
      retention:
        maxCount: 3
      schedule: "*/5 * * * *"

wait 5-10 minutes, make sure that volumeshapshots successfully creating:

$ kubectl -n default get volumesnapshot
NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
snapscheduler-test-snapscheduler-test-202408301525   true         snapscheduler-test                           1Gi           p2p-csi         snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821   87s            2m6s

remove compute disk in GCP (via WebUI or gcloud command) -- human error had happened :

$ pv=$(kubectl -n default get pvc snapscheduler-test -ojsonpath='{.spec.volumeName}')
$ zone=$(gcloud --project=$GCP_PROJECT compute disks list --filter="name=($pv)"|grep pvc|awk '{print $2}')
$ gcloud --project p2p-data-warehouse compute disks delete $pv --zone $zone

after 10 minutes there are two volumesnapshots with readytouse=false:

$ kubectl -n default get volumesnapshot
NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
snapscheduler-test-snapscheduler-test-202408301525   true         snapscheduler-test                           1Gi           p2p-csi         snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821   10m            11m
snapscheduler-test-snapscheduler-test-202408301530   false        snapscheduler-test                                         p2p-csi         snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a                  6m38s
snapscheduler-test-snapscheduler-test-202408301535   false        snapscheduler-test                                         p2p-csi         snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3                  98s

after 15 minutes we don't have any valid snapshot anymore (maxCount: 3 retention policy)

$ kubectl -n default get volumesnapshot
NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
snapscheduler-test-snapscheduler-test-202408301530   false        snapscheduler-test                                         p2p-csi         snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a                  13m
snapscheduler-test-snapscheduler-test-202408301535   false        snapscheduler-test                                         p2p-csi         snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3                  8m6s
snapscheduler-test-snapscheduler-test-202408301540   false        snapscheduler-test                                         p2p-csi         snapcontent-b6113f79-3219-435d-8321-812ddc096154                  3m6s

Expected behavior ❗ retention policy must not take into account VolumeSnapshots with .status.readyToUse==false. ❔ if possible, create a new snapshot only after the previous one has entered the ready state

Actual results retention policy removes last valid snapshot, leaving no possibility of recovery

Additional context

Aug 30 '24 15:08 mnacharov

I agree... that's not good. I'm happy to have thoughts/suggestions on a good fix.

A few ideas:

Only count readyToUse snapshots when implementing the cleanup policy This runs the risk of creating an unbounded number of (unready) snapshots, potentially consuming all available space (or excessive expense)
Skip the next snapshot if the previous is not ready This will cause problems for environments where it takes a long time for the snapshot to become ready (e.g., AWS), causing SnapScheduler to miss intervals
If the policy determines that a snapshot should be deleted, we delete unready snapshots (starting with the oldest) before ready ones. This has the same problem as (2) in being unable to handle intervals that are less than the time for a snapshot to become ready.

Sep 03 '24 19:09 JohnStrunk

snapscheduler snapscheduler copied to clipboard

Retention policy removes last valid snapshot, leaving no possibility of recovery

snapscheduler
snapscheduler copied to clipboard