trident
trident copied to clipboard
trident-csi pods stuck in ContainerCreating after node reboots
Describe the bug multiple trident-csi pods are stuck in ContainerCreating after node reboots with the error: Generated from kubelet on node: 2 times in the last 3 minutes (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[trident-csi-token-z59nn], unattached volumes=[pods-mount-dir dev-dir host-dir trident-tracking-dir plugin-dir sys-dir certs trident-csi-token-z59nn plugins-mount-dir registration-dir]: timed out waiting for the condition
Generated from kubelet on node 19 times in the last 23 minutes MountVolume.SetUp failed for volume "trident-csi-token-z59nn" : secret "trident-csi-token-z59nn" not found
If I delete the pod a new trident-csi pod is created and started ok but without manual intervention the original pod hangs forever and other pods on that node that use trident persistent storage fails to start.
I also noted that when it hangs it references a secret trident-csi-token-z59nn that doesnt exist and after I manually delete the pod and it starts up that pod references another secret that actually exists.
Environment Openshift 4.7.13
- Trident version: 21.04.0
- Trident installation flags used: default install using helm
- Kubernetes version: v1.20.0+df9c838
- Kubernetes orchestrator: Openshift 4.7.13
- NetApp backend types: ONTAP-NAS
To Reproduce Steps to reproduce the behavior: Reboot node
Expected behavior The trident-csi pod to start by using the correct secret
Additional context Add any other context about the problem here.
Hello @gorantornqvist
Thanks for reporting this issue. To give you some background, the secret token trident-csi-token-z59nn
is created when Trident creates a service account name trident-csi
. Trident deployment and daemonset pods use the service account token for API authentication.
The behaviour that exists in Kubernetes is that if a service account is re-created the corresponding token is refreshed but the pods using the old token are not automatically updated. So, what Trident does is automatically re-creates Trident deployment and the daemonset pods on service account recreation.
In your case, I am trying to understand:
- If the service account
trident-csi
was re-created? a. If yes, was it before node reboot, during node reboot or after the node reboot? b. If not, can you consistently reproduce the behaviour and does it involves just rebooting the Kubernetes node? - The Trident operator logs may also be useful in getting some insights, if you can share them here or on Slack or via Support case, that would help as well. Using
kubectl -n <trident installation namespace> logs <trident_operator_pod>
.
Please let us know.
Thank you!
Reincarnation of #444 ?
@gorantornqvist, can you provide more information based on @ntap-arorar's comments?
Hi, Nothing was really done with the trident configuration before this. I actually encountered the same issue on 2 different clusters but after the restart of the pods the issue was resolved. I tried restarting each node in these 2 cluster and the issue didnt occur again - it cant be reproduced.
So I guess this could be hard to troubleshoot.
I am OK with closing this and if it occurrs again I will gather all logs from the trident operator ...
@gorantornqvist, thanks for the feedback. We will reopen this issue if you encounter the problem again.
Hi, We encountered this issue again today when updating 2 different openshift clusters. If I deleted a trident-csi pod it started working (no need for a operator pod restart)
Attaching operator logs from one of the clusters
@gorantornqvist, we looked at the provided logs and it seems that the Cluster was already in a bad state. Our team hasn't been able to reproduce this issue yet. Please let us know if you are still concerned about this issue.
@gorantornqvist, where you able to resolve your issue?
We havent encountered this problem again so this issue can be closed :)
Thanks, for the update!