aws-efs-csi-driver Unable to attach or mount volumes: unmounted volumes=[...], unattached volumes=[...]: timed out waiting for the condition

trafficstars

/kind bug

What happened?

We use EKS to run a distributed task queue that uses the HPA to scale deployments based on the number of tasks in a Redis queue. The pods in these deployments run on an EC2 managed node group. Every pod in the deployment has the same EFS drive attached to access necessary files. We use the efs-csi-node Daemonset, which is managed by the Helm chart. Sometimes, we scale up to a lot of pods at once to accomodate a large number of jobs added to the queue. We have started to see this error appear on some of these pods:

Unable to attach or mount volumes: unmounted volumes=[migrant], unattached volumes=[hdf5-cache hydra-log shared-pod-storage kube-api-access-g8kfd migrant archive hobo model-cache]: timed out waiting for the condition

Most of the pods start successfully, but the ones that do show this event are just stuck in a “ContainerCreating” status. We have tried increasing resource requests for the Daemonset, but that has not helped, and the efs-csi-driver container logs do not provide any helpful information. This has become a problem for us, because our deployments never scale to the level we need them to.

What you expected to happen?

All pods to start with the EFS-backed volume mounted

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): 1.24
Driver version: 1.7.0

Please also attach debug logs to help us better diagnose

Instructions to gather debug logs can be found here results.tgz

Sep 26 '23 18:09 eswolinsky3241

@eswolinsky3241 Could you please provide DEBUG level logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/README.md

Also, how many pods do you add when this issue starts occurring? If there is any other additional information about your cluster that may help us recreate the issue, please let me know.

Sep 27 '23 14:09 seanzatzdev-amazon

@eswolinsky3241 Have you found the root cause or any solution to this issue ?
@seanzatzdev-amazon We are facing the same issue on our EKS cluster v1.27.7-eks-4f4795d, we have seen this issue with v1.6.0 and v1.7.2. We see this issue on 2 deployments (~6 pods in total) that use the same SC/PV/PVC to mount EFS volume. Let me know what other information would be helpful. I'm working on getting some debug logs fom efs-csi-driver. Thank you.

Dec 15 '23 19:12 sorind-broadsign

@sorind-broadsign Was never able to root cause it but at some point it just stopped happening without any change on my part. Haven’t seen the error in months.

Dec 15 '23 21:12 eswolinsky3241

@eswolinsky3241 Have you found the root cause or any solution to this issue ? @seanzatzdev-amazon We are facing the same issue on our EKS cluster v1.27.7-eks-4f4795d, we have seen this issue with v1.6.0 and v1.7.2. We see this issue on 2 deployments (~6 pods in total) that use the same SC/PV/PVC to mount EFS volume. Let me know what other information would be helpful. I'm working on getting some debug logs fom efs-csi-driver. Thank you.

Hey @sorind-broadsign, did you have the opportunity to resolved it? I'm in the same situation.

Jan 12 '24 15:01 rodrilp

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 11 '24 15:04 k8s-triage-robot

aws-efs-csi-driver aws-efs-csi-driver copied to clipboard

Unable to attach or mount volumes: unmounted volumes=[...], unattached volumes=[...]: timed out waiting for the condition

aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard