linstor-csi icon indicating copy to clipboard operation
linstor-csi copied to clipboard

Snapshots not deleted properly, causing orphaned snapshots and eventual system overload

Open boedy opened this issue 5 months ago • 4 comments

We’ve encountered a persistent issue where snapshots are not being properly deleted from the LINSTOR system, resulting in a large number of orphaned snapshots that are putting significant strain on our Kubernetes cluster. This problem has caused severe performance degradation and may have contributed to recent crashes in our LINSTOR controller.

Last week, our cluster went down, likely due to this issue. When the cluster came back online, the LINSTOR controller was unable to start as the datastore seemed to have been corrupted. This issue has persisted across multiple controller restarts. I initially reported this on the LINSTOR forum, where I also outlined the steps I took to get the controller running again.

Context We are creating hourly snapshots via Velero which are retained for 7 days. However many snapshots are not being deleted correctly from LINSTOR, leading to a significant buildup of orphaned snapshots. Despite using a VolumeSnapshotClass with the deletion policy set to Delete, these snapshots remain in the LINSTOR system even after the corresponding VolumeSnapshotContent and PVC objects are deleted in Kubernetes.

Over time, a large number of snapshots (approximately 2500+) accumulated in the LINSTOR system, though the corresponding PVCs and VolumeSnapshotContent objects no longer existed.

Upon investigation, I found that our cluster had over 30,000 PropsContainer records related to these orphaned snapshots, which made operations slow and timeouts more frequent. This likely contributed to LINSTOR controller crashes and resource corruption. Running the command kubectl get propscontainers.internal.linstor.linbit.com | wc -l took more than 40 seconds to complete.

I eventually used a script to manually clean up the orphaned snapshots, which reduced the PropsContainer records to around 838. However, the root cause of the snapshot deletion failure persists. One week later today, the issue has led to the following current state:

velero backup get | wc -l                                     -->     28
linstor snapshot list | wc -l                                 -->    912
k get propscontainers.internal.linstor.linbit.com | wc -l     -->  12235 
k get volumesnapshotcontent | wc -l                           -->    133

Context

  • Velero version: 1.13.1
  • LINSTOR CSI driver version: v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428
  • Piraeus v1.28.0
apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Delete
driver: linstor.csi.linbit.com
kind: VolumeSnapshotClass
metadata:
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
  name: default

linstor-csi-constroller logs and snapshot of resources and snapshots Unfortunatly the linstor controller restarted, which prevents me from fetching the error reports listed in the logs.

time="2024-09-02T09:13:19Z" level=info msg="deleting volume" linstorCSIComponent=client volume=pvc-23f48d11-f801-450e-af00-bc8a3c3174b1
time="2024-09-02T09:13:19Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete volume: Message: 'Node: h-fsn-ded1, Resource: pvc-23f48d11-f801-450e-af00-bc8a3c3174b1 preparing for deletion.'; Details: 'Node: h-fsn-ded1, Resource: pvc-23f48d11-f801-450e-af00-bc8a3c3174b1 UUID is: abf8bedc-375c-41d4-833d-10f33c534e25' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded1'' next error: Message: '(Node: 'h-fsn-ded4') Failed to create meta-data for DRBD volume pvc-23f48d11-f801-450e-af00-bc8a3c3174b1/0'; Reports: '[66B38BF5-0194E-001818]' next error: Message: 'Deletion of resource 'pvc-23f48d11-f801-450e-af00-bc8a3c3174b1' on node 'h-fsn-ded1' failed due to an unknown exception.'; Details: 'Node: h-fsn-ded1, Resource: pvc-23f48d11-f801-450e-af00-bc8a3c3174b1'; Reports: '[66CE38AA-00000-009040]'" linstorCSIComponent=driver method=/csi.v1.Controller/DeleteVolume nodeID= provisioner=linstor.csi.linbit.com req="volume_id:\"pvc-23f48d11-f801-450e-af00-bc8a3c3174b1\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428
time="2024-09-02T09:13:38Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009041]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-6608a2a1-0d6a-4548-95ad-ee02facd1a88\" name:\"snapshot-e3432c8d-cfd0-4a5d-9546-1c9d21cf628e\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428
time="2024-09-02T09:15:04Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009042]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-6608a2a1-0d6a-4548-95ad-ee02facd1a88\" name:\"snapshot-e3432c8d-cfd0-4a5d-9546-1c9d21cf628e\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428
time="2024-09-02T09:15:06Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009043]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428
time="2024-09-02T09:15:07Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009044]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428
time="2024-09-02T09:15:09Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009045]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428
time="2024-09-02T09:15:13Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009046]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428

linstor-csi.log resources.txt snapshots.txt

boedy avatar Sep 04 '24 12:09 boedy