trident Trident operator doesn't clean up iscsi devices after pod moved

Describe the bug We regularly see iscsi devices with 2 failed paths on our worker nodes. In some cases, the device is active on another worker (possible because the pod moved) but in other cases the device does not exist anymore. We therefore suspect that trident operator fails to clean-up unnecessary iscsi devices on the workers.

Output of multipath -ll command: 3600a0980383147586e5d536f7839776d dm-12 NETAPP,LUN C-Mode size=50G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | - 3:0:0:2 sdb 8:16 failed faulty running -+- policy='service-time 0' prio=0 status=enabled `- 4:0:0:2 sdc 8:32 failed faulty running

Environment

Trident version: 23.01.0
Trident installation flags used: -n trident
Container runtime: Docker 23.0.1
Kubernetes version: 1.23.10
Kubernetes orchestrator: Rancher 2.6.9
Kubernetes enabled feature gates: none
OS: RHEL 8.7
NetApp backend types: SVM
Other:

To Reproduce We haven't found a way to reproduce the issue.

Expected behavior I expect trident operator to clean up all unused iscsi devices on worker nodes.

Mar 06 '23 14:03 nheinemans

Hi @nheinemans,

The Trident Operator doesn't have access to the K8S worker nodes where the Trident Node Plugin is running. With Trident v23.01, iSCSI self-healing logic was added to handle situations like a dataLIF being temporarily unreachable and healing that iSCSI path once the dataLIF is available again.

The "failed faulty running" message is not a commonly reported issue by customers. We are unsure of how your environment reached that state. It is recommended that the next time that this error is observed that you open a support case with NetApp Support so that all of the needed Trident log files can be collected to help determine the cause of this multipath error.

Mar 06 '23 15:03 gnarl

time="2023-03-01T14:45:55Z" level=error msg="process killed after timeout" process=umount requestID=30854a38-a950-428b-bc22-e155b27afa51 requestSource=CSI
time="2023-03-01T14:45:55Z" level=error msg="Umount failed, attempting to force umount" error="process killed after timeout" requestID=30854a38-a950-428b-bc22-e155b27afa51 requestSource=CSI
time="2023-03-01T14:46:05Z" level=error msg="process killed after timeout" process=umount requestID=f87bd04a-7167-444b-8c05-9c1c7f74024e requestSource=CSI
time="2023-03-01T14:46:05Z" level=error msg="Umount failed, attempting to force umount" error="process killed after timeout" requestID=f87bd04a-7167-444b-8c05-9c1c7f74024e requestSource=CSI
time="2023-03-01T14:46:15Z" level=error msg="process killed after timeout" process=umount requestID=318bc1f7-d155-4aab-b85a-5ea9e5f6e5cc requestSource=CSI
time="2023-03-01T14:46:15Z" level=error msg="Umount failed, attempting to force umount" error="process killed after timeout" requestID=318bc1f7-d155-4aab-b85a-5ea9e5f6e5cc requestSource=CSI
time="2023-03-01T14:46:18Z" level=error msg="GRPC error: rpc error: code = NotFound desc = could not find volume mount at path: /var/lib/kubelet/pods/dbdbd48d-cbe5-4654-b02f-403cf7cf9809/volumes/kubernetes.io~csi/pvc-26c09799-1143-453c-aee6-1368308b6
f1a/mount; <nil>" requestID=b0772cb9-622b-45c8-aa58-f078014a7fdc requestSource=CSI
time="2023-03-01T14:46:18Z" level=error msg="GRPC error: rpc error: code = NotFound desc = could not find volume mount at path: /var/lib/kubelet/pods/dbdbd48d-cbe5-4654-b02f-403cf7cf9809/volumes/kubernetes.io~csi/pvc-a4818a53-4139-4a2f-86c1-94fe85e65
a71/mount; <nil>" requestID=0f7a28d3-7b16-4f80-b091-707742787930 requestSource=CSI
time="2023-03-01T14:46:25Z" level=error msg="process killed after timeout" process=umount requestID=5569fc1d-7546-4a10-a225-66c5bd0a9031 requestSource=CSI
time="2023-03-01T14:46:25Z" level=error msg="Umount failed, attempting to force umount" error="process killed after timeout" requestID=5569fc1d-7546-4a10-a225-66c5bd0a9031 requestSource=CSI
time="2023-03-01T14:46:25Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-15 error="exit status 1" output="Mar 01 15:46:25 | /dev/dm-15: no usable paths found\n" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=C
SI
time="2023-03-01T14:46:26Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-15 error="exit status 1" output="Mar 01 15:46:26 | /dev/dm-15: no usable paths found\n" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=C
SI
time="2023-03-01T14:46:28Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-15 error="exit status 1" output="Mar 01 15:46:28 | /dev/dm-15: no usable paths found\n" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=C
SI
time="2023-03-01T14:46:30Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-15 error="exit status 1" output="Mar 01 15:46:30 | /dev/dm-15: no usable paths found\n" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=C
SI
time="2023-03-01T14:46:32Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-15 error="exit status 1" output="Mar 01 15:46:32 | /dev/dm-15: no usable paths found\n" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=C
SI
time="2023-03-01T14:46:36Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-15 error="exit status 1" output="Mar 01 15:46:36 | /dev/dm-15: no usable paths found\n" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=C
SI
time="2023-03-01T14:46:36Z" level=error msg="failed to unstage volume" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=CSI
time="2023-03-01T14:46:36Z" level=error msg="GRPC error: rpc error: code = Internal desc = multipath device is unavailable" requestID=6b215e75-4c25-4f94-a61e-80de5501b955 requestSource=CSI
time="2023-03-01T14:46:36Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-12 error="exit status 1" output="Mar 01 15:46:36 | /dev/dm-12: no usable paths found\n" requestID=0bd2f987-ee50-4a85-b3d1-91748141ae54 requestSource=C
SI
time="2023-03-01T14:46:37Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-12 error="exit status 1" output="Mar 01 15:46:37 | /dev/dm-12: no usable paths found\n" requestID=0bd2f987-ee50-4a85-b3d1-91748141ae54 requestSource=C
SI
time="2023-03-01T14:46:39Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-12 error="exit status 1" output="Mar 01 15:46:39 | /dev/dm-12: no usable paths found\n" requestID=0bd2f987-ee50-4a85-b3d1-91748141ae54 requestSource=C
SI
time="2023-03-01T14:46:41Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-12 error="exit status 1" output="Mar 01 15:46:41 | /dev/dm-12: no usable paths found\n" requestID=0bd2f987-ee50-4a85-b3d1-91748141ae54 requestSource=C
SI
time="2023-03-01T14:46:44Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-12 error="exit status 1" output="Mar 01 15:46:44 | /dev/dm-12: no usable paths found\n" requestID=0bd2f987-ee50-4a85-b3d1-91748141ae54 requestSource=CSI
time="2023-03-01T14:46:47Z" level=error msg="Flush pre-check failed for the device." device=/dev/dm-12 error="exit status 1" output="Mar 01 15:46:47 | /dev/dm-12: no usable paths found\n" requestID=0bd2f987-ee50-4a85-b3d1-91748141ae54 requestSource=C
SI

This is logging of the trident node linux pod on the faulty host.

I'm in the process of getting access to NetApp Support, but since Trident Operator is open-source, I was hoping to get some help here.

Mar 06 '23 15:03 nheinemans

@nheinemans, thanks for the additional information. @ntap-arorar looked at the issue and thinks to could be one of the following possibilities:

One of the session paths was in a bad state (could be a networking issue)
LUN was already deleted from the storage controller.
Manually changed iGroup settings which cut off the volume access to the node.
Manually removed the volume path and/or mount from the node without Trident’s knowledge.

If one of the above situations existed, then when the pod was deleted, the umount and force amount failed in the Trident Node Plugin. NodeUnstaging kept retrying to flush the device until it hits the 6 minute timeout for the retry logic.

It is suggested that you open the NetApp support case to help determine how your environment ended up in this situation and to provide advise on how to clean it up.

Mar 06 '23 16:03 gnarl

trident trident copied to clipboard

Trident operator doesn't clean up iscsi devices after pod moved

trident
trident copied to clipboard