trident icon indicating copy to clipboard operation
trident copied to clipboard

TridentVolume resource lingering when PVC force-deleted.

Open uberspot opened this issue 3 years ago • 2 comments

Describe the bug We have observed a large number of orphan tridentVolume resources in our k8s cluster, stuck in state==deleting. And that seems to persist for ever (there's some there from the past 2 years). And we also observe in the trident csi daemonset logs that for a lot of these it could not find the equivalent PVC. We cross-checked all of the tridentvolumes and we found no corresponding pvcs for them in the cluster. Potentially the pvc could have been force deleted (by removing the pvc finalizer (?)).

finalizers:
  - kubernetes.io/pvc-protection`

What is also odd is that all the tridentVolumes that are in state==deletion are always orphaned=false, unless i'm misunderstanding the context of 'orphaned' below?

  orphaned: false
  pool: aggr1_uk2pna01cn02
  state: deleting

oc get tridentvolume -o jsonpath='{.items[?(@.state=="deleting")].metadata.name}' reports around 380+ of those resources for our test cluster.

Trident csi daemonset constantly errors out in logs:

[trident-csi-74f995cfc7-9rfns trident-main] time="2022-07-08T12:39:30Z" level=error msg="GRPC error: rpc error: code = NotFound desc = source volume pvc-3839723a-71bf-40e4-94b5-a372f67a023b not found" requestID=de138361-2454-4919-a674-167bc5ef1059 requestSource=CSI 
[trident-csi-74f995cfc7-9rfns csi-snapshotter] E0708 12:33:55.061877       1 snapshot_controller_base.go:265] could not sync content "snapcontent-9d5d528a-274b-4c88-bf2d-324ad001be46": failed to take snapshot of the volume pvc-9dd8d2c8-b5f9-4fc3-aea6-556948aa6e43: "rpc error: code = NotFound desc = source volume pvc-9dd8d2c8-b5f9-4fc3-aea6-556948aa6e43 not found" 

Also finalizer removal on the pvc doesn't explain why the state on tridentVolume is == "deleting". If it is, shouldn't it just delete them even if the PVC is no longer there? Is there some other bit of the puzzle missing here or is our usage wrong?

Environment

  • Trident version: 22.01.0
  • Trident installation flags used: no custom yaml/configs other than what's on github
  • Container runtime: cri-o://1.21.5-3.rhaos4.8.gitaf64931.el8
  • Kubernetes version: v1.21.8
  • Kubernetes orchestrator: Openshift 4.9.37
  • Kubernetes enabled feature gates:
  • OS: Red Hat Enterprise Linux CoreOS 48.84.202203140855-0
  • NetApp backend types: OnTap NAS
  • Other:

To Reproduce Steps to reproduce the behavior:

  • We're not sure exactly, but we observed lingering tridentVolume resources in the PVCs that were forcefully deleted and had removed the pvc finalizer (this is a slight assumption).

Expected behavior Trident surfaces these failures in metrics that we can alert on and potentially auto-cleans the lingering resource. Maybe not by default but hidden behind a flag (?). OR if our usage is wrong here maybe update docs on it.

uberspot avatar Jul 08 '22 13:07 uberspot

Hi @uberspot,

You may want to contact NetApp support and open a support case so that we can help you further.

What seems to be happening is that Trident has received a delete request for a PVC/PV. Trident places the volume in a "deleting" state but snapshots exist so the PV cannot be deleted until the snapshots are also deleted.

At some point the PVC is force deleted, which does require someone to remove the finalizer. This does remove the PVC object from Kubernetes but doesn't change the state of the Trident volume since the snapshots haven't been deleted.

If instead the user had first deleted the volume snapshots associated with the volume then the volume would have been successfully deleted.

All of this seems plausible except for maybe the error message you are seeing related to the failed snapshot. It appears that something is trying to take a snapshot of the volume after it the PVC was force deleted?

gnarl avatar Aug 02 '22 19:08 gnarl

Check for volumesnapshots. These should have been deleted first. If not, it can require additional steps to clean up everything: https://kb.netapp.com/Advice_and_Troubleshooting/Cloud_Services/Astra_Trident/Trident__backend_removal_remain_in_deleting_state_due_to_volume_in_same_situation

crossond avatar Aug 05 '22 18:08 crossond

KB article was provide and Trident volume deletion has been explained. Closing the issue since there are no questions.

uppuluri123 avatar Dec 04 '23 03:12 uppuluri123