trident icon indicating copy to clipboard operation
trident copied to clipboard

Trident snapmirror does not allow to cleanup last snapshot after successful failover

Open ddandreaisp opened this issue 8 months ago • 7 comments

Describe the bug We use trident csi snapmirror features by managing tridentmirrorrelationship and tridentactionmirrorupdate custom resources in our own automation (no trident protect or astra cc). When a graceful failover is performed from the primary site "openshift-cluster-A" to the secondary site "openshift-cluster-B", the following actions are taken:

  • "openshift-cluster-A": application is stopped (PVCs are no more in use)

for each replicated PVC associated to a tridentmirrorrelationship:

  • "openshift-cluster-A": take last snapshot (create a volumesnapshot resource) and wait to be ReadyToUse and get its snapshot handle
  • "openshift-cluster-B": create a tridentactionmirrorupdate CR referencing the corresponding snapshot handle from "openshift-cluster-A"
  • "openshift-cluster-B": wait for tridentactionmirrorupdate to reach "Succeeded" state
  • "openshift-cluster-B": set tridentmirrorrelationship .spec.state to "promoted" and wait for .status.state to reach "promoted" value
  • "openshift-cluster-A": set tridentmirrorrelationship .spec.state to "reestablished" and wait for .status.state to become reach "reestablished" state
  • "openshift-cluster-A": deleting last volumesnapshot taken before the failover

The last operation works with version 24.02 and 24.06 but fails with 24.10 and next releases with this message in the trident-controller logs:

level=error msg="Unable to delete snapshot from backend." backendName=tbc-cvohatest05 backendUUID=6b78a294-1a4c-40a8-85df-7f06439b63b4 error="error deleting snapshot: API status: failed, Reason: This Snapshot copy is currently used as a reference Snapshot copy by one or more SnapMirror relationships. Deleting the Snapshot copy can cause future SnapMirror operations to fail. , Code: 13012" logLayer=core requestID=09443fb3-8f54-47bf-8c67-e0377a9fc203 requestSource=CSI snapshotID=pvc-2cf349a9-48b5-41a1-bc6d-be77c526b8be/snapshot-982f86fa-03f3-4de5-ad11-9c244f7d22da volume=pvc-2cf349a9-48b5-41a1-bc6d-be77c526b8be workflow="snapshot=delete"

We are aware that the latest snapshot in a trident mirror relationship must not be deleted, but this is true while this event occurs on the "primary" side. Once a complete graceful failover has been executed the driver should allow the deletion of this "residual" snapshot. We need to delete this old snapshot to avoid volume space saturation

Environment

  • Trident version: 24.10.x and next releases
  • Trident installation flags used: N.A - trident installed via helmchart + operator
  • Container runtime: RuntimeName: cri-o RuntimeVersion: 1.27.8-8.rhaos4.14.git17cbe6d.el9 RuntimeApiVersion: v1
  • Kubernetes version: v1.27.16+03a907c
  • Kubernetes orchestrator: Openshift 4.14.38
  • Kubernetes enabled feature gates: [e.g. CSINodeInfo]
  • OS: Red Hat Enterprise Linux CoreOS release 4.14
  • NetApp backend types: ontap-nas
  • Other:

To Reproduce

  • Required environment : two openshift cluster poining to two different peered ontap-nas backends
  • create a valid tridentmirror relationship between a PVC on cluster A and cluster B
  • populate some data in PVC on cluster A
  • perform few sync actions from A to B (snapshot + tridentactionmirrorupdate as described in Netapp Docs)
  • stop any pod using the PVC on site A
  • execute failover operations     - Site A: last snapshot • Site B: lasy sync using tamu CR       - promote site B using its TMR CR       - reestablish site A using its TMR CR       - delete last snapshot on site A: the snapshot remains in "pending delete" on site A and error messages are generated on trident controller pod

Expected behavior allow the deletion of the "residual" snapshot(s) on the "old primary site" (openshift-cluster-A in the example)

Additional context opened case 2010360852 but since now no in-depth explanation given

ddandreaisp avatar Apr 11 '25 13:04 ddandreaisp

@clintonk since I noticed you are one of the main contributors, could you kindly please advise on this ? Thank you

ddandreaisp avatar Apr 16 '25 12:04 ddandreaisp

@ddandreaisp If you could answer a couple questions to help us diagnose this issue.

  1. What version of ONTAP are you using? The error on the snapshot delete is coming from the ONTAP API when Trident attempts the delete.
  2. Does the snapshot ever delete, or is it indefinitely stuck in the deleting state? It is possible that with the "reestablish" in Site A the newest snapshot taken during "reestablish" isn't synced and ready yet so the backend still thinks your snapshot is the latest.

torirevilla avatar Apr 16 '25 16:04 torirevilla

@ddandreaisp If you could answer a couple questions to help us diagnose this issue.

  1. What version of ONTAP are you using? The error on the snapshot delete is coming from the ONTAP API when Trident attempts the delete.
  2. Does the snapshot ever delete, or is it indefinitely stuck in the deleting state? It is possible that with the "reestablish" in Site A the newest snapshot taken during "reestablish" isn't synced and ready yet so the backend still thinks your snapshot is the latest.

Hi @torirevilla , I'm collaborating with @ddandreaisp so I can provide the additional info:

  1. ONTAP version is 9.14.1
  2. Kubernetes Snapshot stucks in the deleting state (with deletionTimestamp set in the resource), while error messages are produced in the trident csidriver logs

In our tests we found no way to complete this delete action also when new snapshost are propagated from the new-primary site making the deleting snapshot "obsolete" and therefore removable

As described in the issue, we are observing this behaviour in our automation starting from trident 24.10

rdemarinis avatar Apr 17 '25 07:04 rdemarinis

Hi @torirevilla any hints on this?

Thank you

ddandreaisp avatar Apr 23 '25 09:04 ddandreaisp

Hi @ddandreaisp the support team is looking into the case you opened and still identifying the root cause.

torirevilla avatar Apr 23 '25 14:04 torirevilla

@ddandreaisp Issue is reproduced in house and an internal bug is created

praveene12 avatar Jun 19 '25 10:06 praveene12

Hello, I am following up in this thread because we waited, as requested, for version 25.10. However, our NetApp representative informed us at the beginning of November that "we are currently investigating with our engineering team because, although Trident 25.10 has been officially released, the fix for bug TRID-17670 does not appear in the release notes."

That said, we have requested updates on this matter but have not received any so far. Therefore, I kindly ask you to confirm whether this information is accurate and to let us know how long we should expect to wait for the fix to be implemented and made available.

Thank you for your time and cooperation.

ddandreaisp avatar Nov 28 '25 15:11 ddandreaisp