aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard
Missing efs.csi.aws.com-reg.sock file on EKS Node.
/kind bug
What happened? When deploying the aws-efs-csi-driver helm chart, as the efs-csi-node daemonset spins up, certain pods get stuck in a CrashLoopBackOff state. The logs for the efs-plugin container look normal:
I1201 20:11:53.735781 1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I1201 20:11:53.736836 1 metadata.go:63] getting MetadataService...
I1201 20:11:53.738274 1 metadata.go:68] retrieving metadata from EC2 metadata service
I1201 20:11:53.831426 1 driver.go:140] Did not find any input tags.
I1201 20:11:53.831724 1 driver.go:113] Registering Node Server
I1201 20:11:53.831742 1 driver.go:115] Registering Controller Server
I1201 20:11:53.831752 1 driver.go:118] Starting efs-utils watchdog
I1201 20:11:53.831846 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I1201 20:11:53.831860 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I1201 20:11:53.832163 1 driver.go:124] Starting reaper
I1201 20:11:53.832182 1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
But the logs for the csi-driver-registrar container just show /usr/bin/csi-node-driver-registrar: error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory
Likewise, the logs for the liveness-probe are just: /usr/bin/livenessprobe: error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory
Looking at the nodes the failing pods are running on, I've discovered that they do not have the /var/lib/kubelet/plugins_registry/efs.csi.aws.com-reg.sock file.
The pods in the daemonset that do spin up properly do have this file. I'm unsure why this file is missing on some nodes, and I don't know how to configure the helm chart to ensure that this file gets created.
What you expected to happen? I expect all of the pods in efs-csi-node daemonset to spin up properly.
How to reproduce it (as minimally and precisely as possible)?
This is unpredictable. I can fix the issue by destroying a node, and when a new node spins up, the /var/lib/kubelet/plugins_registry/efs.csi.aws.com-reg.sock file exists and the pods work as expected.
Anything else we need to know?:
Environment
- Kubernetes version (use
kubectl version): v1.27.7-eks-4f4795d - Driver version: 2.5.0
Bumping this. Any info would be very helpful. Seeing this a lot.
In my case, Pods are not initializing due to FailedMount event. When connected to the Node and checked the /var/lib/plugins_registry/ , did not have the "efs.csi.aws.com-reg.sock" file. Checked the logs of "CSI Driver Registrar" the logs look normal. Also brief about the EKS cluster, I have 1 worker node and 2nd node is created dynamically and efs-csi-node daemonset does the required setup.
Also if all the workloads on the static worker node are removed and then a new node is created dynamically, then the file "efs.csi.aws.com-reg.sock" is created properly and volume is mounted successfully.
The same setup I have it in a different cluster which is working pretty fine.