aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

Unable to attach or mount volumes: unmounted volumes=[...], unattached volumes=[...]: timed out waiting for the condition

Open eswolinsky3241 opened this issue 2 years ago • 6 comments
trafficstars

/kind bug

What happened?

We use EKS to run a distributed task queue that uses the HPA to scale deployments based on the number of tasks in a Redis queue. The pods in these deployments run on an EC2 managed node group. Every pod in the deployment has the same EFS drive attached to access necessary files. We use the efs-csi-node Daemonset, which is managed by the Helm chart. Sometimes, we scale up to a lot of pods at once to accomodate a large number of jobs added to the queue. We have started to see this error appear on some of these pods:

Unable to attach or mount volumes: unmounted volumes=[migrant], unattached volumes=[hdf5-cache hydra-log shared-pod-storage kube-api-access-g8kfd migrant archive hobo model-cache]: timed out waiting for the condition

Most of the pods start successfully, but the ones that do show this event are just stuck in a “ContainerCreating” status. We have tried increasing resource requests for the Daemonset, but that has not helped, and the efs-csi-driver container logs do not provide any helpful information. This has become a problem for us, because our deployments never scale to the level we need them to.

What you expected to happen?

All pods to start with the EFS-backed volume mounted

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): 1.24
  • Driver version: 1.7.0

Please also attach debug logs to help us better diagnose

eswolinsky3241 avatar Sep 26 '23 18:09 eswolinsky3241

@eswolinsky3241 Could you please provide DEBUG level logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/README.md

Also, how many pods do you add when this issue starts occurring? If there is any other additional information about your cluster that may help us recreate the issue, please let me know.

seanzatzdev-amazon avatar Sep 27 '23 14:09 seanzatzdev-amazon

@eswolinsky3241 Have you found the root cause or any solution to this issue ?
@seanzatzdev-amazon We are facing the same issue on our EKS cluster v1.27.7-eks-4f4795d, we have seen this issue with v1.6.0 and v1.7.2. We see this issue on 2 deployments (~6 pods in total) that use the same SC/PV/PVC to mount EFS volume. Let me know what other information would be helpful. I'm working on getting some debug logs fom efs-csi-driver. Thank you.

sorind-broadsign avatar Dec 15 '23 19:12 sorind-broadsign

@sorind-broadsign Was never able to root cause it but at some point it just stopped happening without any change on my part. Haven’t seen the error in months.

eswolinsky3241 avatar Dec 15 '23 21:12 eswolinsky3241

@eswolinsky3241 Have you found the root cause or any solution to this issue ? @seanzatzdev-amazon We are facing the same issue on our EKS cluster v1.27.7-eks-4f4795d, we have seen this issue with v1.6.0 and v1.7.2. We see this issue on 2 deployments (~6 pods in total) that use the same SC/PV/PVC to mount EFS volume. Let me know what other information would be helpful. I'm working on getting some debug logs fom efs-csi-driver. Thank you.

Hey @sorind-broadsign, did you have the opportunity to resolved it? I'm in the same situation.

rodrilp avatar Jan 12 '24 15:01 rodrilp

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 11 '24 15:04 k8s-triage-robot