trident Multi-Attach Error in Kubernetes Cluster using Trident iSCSI Multipath as storage solution

Describe the Error We have a cluster, that is using the trident as a storage solution for PersistentVolumes. Unfortunately, we often get an error that looks like this one here

Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: timed out waiting for the condition

or

 Multi-Attach error for volume "pvc-xxx" Volume is already exclusively attached to one node and can't be attached to another

Usually it happens after we drain nodes for host or Kubernetes updates. Sometimes when expanding the Volume. We observed this also by just deleting a pod with a pv attached, which moves to another node, at the “wrong” time. We still can't figure out, what that wrong time is, but the load average in the latter case was observed to be acceptable at around 1 per CPU at most.

Current Solution is to kill the pod a bunch of times. We tried it with kubectl rollout restart,kubectl scale sts/deploy --replicas=0 and then after deletion back to the desired number, waiting for around half an hour, deleting the respective trident-csi pod and all of them, ...

Nothing works accept for killing (not forcefully) the pod, wait 5 min and if still not running, kill again.

Environment

Trident version: [22.07] with values as in trident.txt
Protocol: iscsi multipath
Done suggestion: https://docs.netapp.com/us-en/trident-2207/trident-use/worker-node-prep.html
useful tridentlogs: tridentlogs.txt
kubectl get tbc ontap-san is in PHASE bound and STATUS Success
kubectl get tbe has a backend-uid
On Premis Kubernetes: [v1.22.9]
Ubuntu [20.04.5 LTS]
4 Workers
200 Pods in total
42 PVs from 1Gi up to 300Gi
Each Worker has 32GB Memory and 8 CPU
Docker: [v20.10.12+]
iscsiadm: [v2.0-874]
iscsid.conf: see attached file iscsid.txt
multipath-tools [v0.8.3]
multipath.conf: see attached file multipath.txt
journalctl shows no problems

To Reproduce As stated, we observed this when draining a node because of Updates. But it can also happen when expanding a volume, or just randomly by deleting a pod with a PV attached, which moves to another node.

Expected behavior When moving a pod with a PV by for example draining the node or deleting the pod, it should attach the volume without the errors mentioned above

Additional context Add any other context about the problem here.

Nov 28 '22 12:11 Ujkugri

@Ujkugri, have you figured out what was the root cause of your issue? Just faced a similar issue with Trident 24.02.0. It has occurred at the same time when updating to the new ONTAP release.

Jun 10 '24 14:06 phhutter

Hi @phhutter,

we actually solved it. Appearently our initial script had a the problem, that it did not generate a unique IQN. We fixed this by creating an one-shot systemd service that enforced this.

I dont think this helps you specifically, but we researched quiet a lot. Maybe if you give more details, I can help. But I cannot guarantee it.

Jun 10 '24 19:06 Ujkugri

@Ujkugri Can you explain what this initial script is and what you did to solve it? I think we might be facing this issue as well.

Much appreciated!

Jul 04 '24 20:07 tijmenvandenbrink

@tijmenvandenbrink What I meant with "Initial Script" was an ansible script that prepared the nodes. It just installed the necessary services and enabled them. We added a one-shot systemd service that gave the node a unique IQN number. Now, I am not sure, if this was the command exactly, since I think there were parameters, but I think the command was just "iscsi-iname". Depending on how you do your systemd service, you might need to restart the multipath and iscsi services.

Jul 04 '24 20:07 Ujkugri

@Ujkugri Please let us know if this has been resolved. If so, please close this issue.

Oct 30 '24 13:10 sjpeeris