Multi-Attach Error in Kubernetes Cluster using Trident iSCSI Multipath as storage solution
Describe the Error We have a cluster, that is using the trident as a storage solution for PersistentVolumes. Unfortunately, we often get an error that looks like this one here
Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: timed out waiting for the condition
or
Multi-Attach error for volume "pvc-xxx" Volume is already exclusively attached to one node and can't be attached to another
Usually it happens after we drain nodes for host or Kubernetes updates. Sometimes when expanding the Volume. We observed this also by just deleting a pod with a pv attached, which moves to another node, at the “wrong” time. We still can't figure out, what that wrong time is, but the load average in the latter case was observed to be acceptable at around 1 per CPU at most.
Current Solution is to kill the pod a bunch of times. We tried it with kubectl rollout restart,kubectl scale sts/deploy --replicas=0 and then after deletion back to the desired number, waiting for around half an hour, deleting the respective trident-csi pod and all of them, ...
Nothing works accept for killing (not forcefully) the pod, wait 5 min and if still not running, kill again.
Environment
- Trident version: [22.07] with values as in trident.txt
- Protocol: iscsi multipath
- Done suggestion: https://docs.netapp.com/us-en/trident-2207/trident-use/worker-node-prep.html
- useful tridentlogs: tridentlogs.txt
- kubectl get tbc ontap-san is in PHASE bound and STATUS Success
- kubectl get tbe has a backend-uid
- On Premis Kubernetes: [v1.22.9]
- Ubuntu [20.04.5 LTS]
- 4 Workers
- 200 Pods in total
- 42 PVs from 1Gi up to 300Gi
- Each Worker has 32GB Memory and 8 CPU
- Docker: [v20.10.12+]
- iscsiadm: [v2.0-874]
- iscsid.conf: see attached file iscsid.txt
- multipath-tools [v0.8.3]
- multipath.conf: see attached file multipath.txt
- journalctl shows no problems
To Reproduce As stated, we observed this when draining a node because of Updates. But it can also happen when expanding a volume, or just randomly by deleting a pod with a PV attached, which moves to another node.
Expected behavior When moving a pod with a PV by for example draining the node or deleting the pod, it should attach the volume without the errors mentioned above
Additional context Add any other context about the problem here.
@Ujkugri, have you figured out what was the root cause of your issue? Just faced a similar issue with Trident 24.02.0. It has occurred at the same time when updating to the new ONTAP release.
Hi @phhutter,
we actually solved it. Appearently our initial script had a the problem, that it did not generate a unique IQN. We fixed this by creating an one-shot systemd service that enforced this.
I dont think this helps you specifically, but we researched quiet a lot. Maybe if you give more details, I can help. But I cannot guarantee it.
@Ujkugri Can you explain what this initial script is and what you did to solve it? I think we might be facing this issue as well.
Much appreciated!
@tijmenvandenbrink What I meant with "Initial Script" was an ansible script that prepared the nodes. It just installed the necessary services and enabled them. We added a one-shot systemd service that gave the node a unique IQN number. Now, I am not sure, if this was the command exactly, since I think there were parameters, but I think the command was just "iscsi-iname". Depending on how you do your systemd service, you might need to restart the multipath and iscsi services.
@Ujkugri Please let us know if this has been resolved. If so, please close this issue.