trident PV Resize/Delete Stuck on Finalizer and Discrepancy Between Kubelet and NetApp Volume Usage

Describe the bug Kubernetes kubelet volume metrics report significantly less usage for a PersistentVolume (PV) than what is actually being consumed on the NetApp backend. As a result, the volume fills up on the NetApp side while the kubelet still reports only ~30-50% usage. When this happens, we are unable to resize or delete the volume/PVC; the resource gets stuck on its finalizer. The controller then logs errors indicating that it cannot find the volume mount

time="2025-06-16T05:58:48Z" level=error msg="GRPC error: rpc error: code = NotFound desc = could not find volume mount at path: /var/lib/kubelet/pods/7d7e7cfe-41ab-4641-81bb-1a5f15908edb/volumes/kubernetes.io~csi/pvc-e25c7021-63c2-414a-a9f4-3b4beb22a4d6/mount; <nil>" logLayer=csi_frontend requestID=0b721dcd-20e5-4c06-9a75-5c57e259a3cf requestSource=CSI

Logs from kubectl describe pod:

  Warning  VolumeResizeFailed  10m (x12 over 20m)  kubelet  NodeExpandVolume.NodeExpandVolume failed for volume "pvc-ce47acbd-3850-4ba4-970e-252b232fa2f3" : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = unable to mount device; can't determine if directory /var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/d54d1f61157c9e1040d4c331f9419a80bacbfc10cafa2da929b02b8102444faa/globalmount/tmp_mnt exists; stat /var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/d54d1f61157c9e1040d4c331f9419a80bacbfc10cafa2da929b02b8102444faa/globalmount/tmp_mnt: input/output error

Environment

Trident version: 23.10.0
Trident installation flags used: [e.g. -d -n trident --use-custom-yaml]
Container runtime: cri-o://1.28.8
Kubernetes version: 1.28.12
Kubernetes orchestrator:
Kubernetes enabled feature gates: [e.g. CSINodeInfo]
OS: RHEL 8
NetApp backend types: ONTAP SAN
Other:

To Reproduce

Create a PVC and mount it to a pod/service in K8s.
Write data to fill the volume from within the pod.
Monitor the NetApp backend, and see that the volume fills up.
Observe that the kubelet metrics report the volume as only partly used significantly lower than actual.
Attempt to resize or delete the volume/PVC. It remains stuck on a finalizer.
The controller logs errors about not finding the mount.

Expected behavior

Kubelet volume usage metrics should accurately reflect the usage on the NetApp backend.
Additionally, when the volume is full, it should be possible to resize or delete the PVC/PV without manually rebooting the node / force deleting the PV/PVC.

Additional context

The discrepancy between kubelet and NetApp volume metrics seems to affect services that depend on the free disk space data.
Inability to delete the PV/PVC causes workflow disruptions.

Jun 16 '25 06:06 shajia-deshaw

Hello @shajia-deshaw, Can this be reproduced in the latest release of Trident?

Aug 28 '25 19:08 torirevilla

Sorry, we're still working on upgrading it to 25.x. Is this a known issue in 23.x or any link to a PR/patch that fixes this?

Sep 02 '25 06:09 shajia-deshaw

Hi @shajia-deshaw, are you using discard or trim mount options? This could get you closer to the usage metrics in ONTAP. There have been multiple bug fixes between 23.x and our most recent release.

Sep 03 '25 19:09 torirevilla