PV Resize/Delete Stuck on Finalizer and Discrepancy Between Kubelet and NetApp Volume Usage
Describe the bug
Kubernetes kubelet volume metrics report significantly less usage for a PersistentVolume (PV) than what is actually being consumed on the NetApp backend. As a result, the volume fills up on the NetApp side while the kubelet still reports only ~30-50% usage. When this happens, we are unable to resize or delete the volume/PVC; the resource gets stuck on its finalizer. The controller then logs errors indicating that it cannot find the volume mount
time="2025-06-16T05:58:48Z" level=error msg="GRPC error: rpc error: code = NotFound desc = could not find volume mount at path: /var/lib/kubelet/pods/7d7e7cfe-41ab-4641-81bb-1a5f15908edb/volumes/kubernetes.io~csi/pvc-e25c7021-63c2-414a-a9f4-3b4beb22a4d6/mount; <nil>" logLayer=csi_frontend requestID=0b721dcd-20e5-4c06-9a75-5c57e259a3cf requestSource=CSI
Logs from kubectl describe pod:
Warning VolumeResizeFailed 10m (x12 over 20m) kubelet NodeExpandVolume.NodeExpandVolume failed for volume "pvc-ce47acbd-3850-4ba4-970e-252b232fa2f3" : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = unable to mount device; can't determine if directory /var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/d54d1f61157c9e1040d4c331f9419a80bacbfc10cafa2da929b02b8102444faa/globalmount/tmp_mnt exists; stat /var/lib/kubelet/plugins/kubernetes.io/csi/csi.trident.netapp.io/d54d1f61157c9e1040d4c331f9419a80bacbfc10cafa2da929b02b8102444faa/globalmount/tmp_mnt: input/output error
Environment
- Trident version: 23.10.0
- Trident installation flags used: [e.g. -d -n trident --use-custom-yaml]
- Container runtime: cri-o://1.28.8
- Kubernetes version: 1.28.12
- Kubernetes orchestrator:
- Kubernetes enabled feature gates: [e.g. CSINodeInfo]
- OS: RHEL 8
- NetApp backend types: ONTAP SAN
- Other:
To Reproduce
- Create a PVC and mount it to a pod/service in K8s.
- Write data to fill the volume from within the pod.
- Monitor the NetApp backend, and see that the volume fills up.
- Observe that the kubelet metrics report the volume as only partly used significantly lower than actual.
- Attempt to resize or delete the volume/PVC. It remains stuck on a finalizer.
- The controller logs errors about not finding the mount.
Expected behavior
- Kubelet volume usage metrics should accurately reflect the usage on the NetApp backend.
- Additionally, when the volume is full, it should be possible to resize or delete the PVC/PV without manually rebooting the node / force deleting the PV/PVC.
Additional context
- The discrepancy between kubelet and NetApp volume metrics seems to affect services that depend on the free disk space data.
- Inability to delete the PV/PVC causes workflow disruptions.
Hello @shajia-deshaw, Can this be reproduced in the latest release of Trident?
Sorry, we're still working on upgrading it to 25.x. Is this a known issue in 23.x or any link to a PR/patch that fixes this?
Hi @shajia-deshaw, are you using discard or trim mount options? This could get you closer to the usage metrics in ONTAP. There have been multiple bug fixes between 23.x and our most recent release.