trident icon indicating copy to clipboard operation
trident copied to clipboard

trident-csi pods stuck in ContainerCreating after node reboots

Open gorantornqvist opened this issue 3 years ago • 7 comments

Describe the bug multiple trident-csi pods are stuck in ContainerCreating after node reboots with the error: Generated from kubelet on node: 2 times in the last 3 minutes (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[trident-csi-token-z59nn], unattached volumes=[pods-mount-dir dev-dir host-dir trident-tracking-dir plugin-dir sys-dir certs trident-csi-token-z59nn plugins-mount-dir registration-dir]: timed out waiting for the condition

Generated from kubelet on node 19 times in the last 23 minutes MountVolume.SetUp failed for volume "trident-csi-token-z59nn" : secret "trident-csi-token-z59nn" not found

If I delete the pod a new trident-csi pod is created and started ok but without manual intervention the original pod hangs forever and other pods on that node that use trident persistent storage fails to start.

I also noted that when it hangs it references a secret trident-csi-token-z59nn that doesnt exist and after I manually delete the pod and it starts up that pod references another secret that actually exists.

Environment Openshift 4.7.13

  • Trident version: 21.04.0
  • Trident installation flags used: default install using helm
  • Kubernetes version: v1.20.0+df9c838
  • Kubernetes orchestrator: Openshift 4.7.13
  • NetApp backend types: ONTAP-NAS

To Reproduce Steps to reproduce the behavior: Reboot node

Expected behavior The trident-csi pod to start by using the correct secret

Additional context Add any other context about the problem here.

gorantornqvist avatar Jun 11 '21 13:06 gorantornqvist

Hello @gorantornqvist

Thanks for reporting this issue. To give you some background, the secret token trident-csi-token-z59nn is created when Trident creates a service account name trident-csi. Trident deployment and daemonset pods use the service account token for API authentication. The behaviour that exists in Kubernetes is that if a service account is re-created the corresponding token is refreshed but the pods using the old token are not automatically updated. So, what Trident does is automatically re-creates Trident deployment and the daemonset pods on service account recreation.

In your case, I am trying to understand:

  1. If the service account trident-csi was re-created? a. If yes, was it before node reboot, during node reboot or after the node reboot? b. If not, can you consistently reproduce the behaviour and does it involves just rebooting the Kubernetes node?
  2. The Trident operator logs may also be useful in getting some insights, if you can share them here or on Slack or via Support case, that would help as well. Using kubectl -n <trident installation namespace> logs <trident_operator_pod>.

Please let us know.

Thank you!

rohit-arora-dev avatar Jun 11 '21 15:06 rohit-arora-dev

Reincarnation of #444 ?

megabreit avatar Jun 11 '21 23:06 megabreit

@gorantornqvist, can you provide more information based on @ntap-arorar's comments?

gnarl avatar Jun 14 '21 14:06 gnarl

Hi, Nothing was really done with the trident configuration before this. I actually encountered the same issue on 2 different clusters but after the restart of the pods the issue was resolved. I tried restarting each node in these 2 cluster and the issue didnt occur again - it cant be reproduced.

So I guess this could be hard to troubleshoot.

I am OK with closing this and if it occurrs again I will gather all logs from the trident operator ...

gorantornqvist avatar Jun 15 '21 12:06 gorantornqvist

@gorantornqvist, thanks for the feedback. We will reopen this issue if you encounter the problem again.

gnarl avatar Jun 21 '21 14:06 gnarl

Hi, We encountered this issue again today when updating 2 different openshift clusters. If I deleted a trident-csi pod it started working (no need for a operator pod restart)

Attaching operator logs from one of the clusters

trident-operator-86c5b968cb-gz6p9.log

gorantornqvist avatar Jul 20 '21 11:07 gorantornqvist

@gorantornqvist, we looked at the provided logs and it seems that the Cluster was already in a bad state. Our team hasn't been able to reproduce this issue yet. Please let us know if you are still concerned about this issue.

gnarl avatar Oct 22 '21 19:10 gnarl

@gorantornqvist, where you able to resolve your issue?

gnarl avatar Feb 21 '23 00:02 gnarl

We havent encountered this problem again so this issue can be closed :)

gorantornqvist avatar Feb 22 '23 07:02 gorantornqvist

Thanks, for the update!

gnarl avatar Feb 22 '23 19:02 gnarl