aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

Missing efs.csi.aws.com-reg.sock file on EKS Node.

Open ryanhockstad opened this issue 1 year ago • 4 comments
trafficstars

/kind bug

What happened? When deploying the aws-efs-csi-driver helm chart, as the efs-csi-node daemonset spins up, certain pods get stuck in a CrashLoopBackOff state. The logs for the efs-plugin container look normal:

I1201 20:11:53.735781       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I1201 20:11:53.736836       1 metadata.go:63] getting MetadataService...
I1201 20:11:53.738274       1 metadata.go:68] retrieving metadata from EC2 metadata service
I1201 20:11:53.831426       1 driver.go:140] Did not find any input tags.
I1201 20:11:53.831724       1 driver.go:113] Registering Node Server
I1201 20:11:53.831742       1 driver.go:115] Registering Controller Server
I1201 20:11:53.831752       1 driver.go:118] Starting efs-utils watchdog
I1201 20:11:53.831846       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I1201 20:11:53.831860       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I1201 20:11:53.832163       1 driver.go:124] Starting reaper
I1201 20:11:53.832182       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}

But the logs for the csi-driver-registrar container just show /usr/bin/csi-node-driver-registrar: error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory

Likewise, the logs for the liveness-probe are just: /usr/bin/livenessprobe: error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory

Looking at the nodes the failing pods are running on, I've discovered that they do not have the /var/lib/kubelet/plugins_registry/efs.csi.aws.com-reg.sock file.

The pods in the daemonset that do spin up properly do have this file. I'm unsure why this file is missing on some nodes, and I don't know how to configure the helm chart to ensure that this file gets created.

What you expected to happen? I expect all of the pods in efs-csi-node daemonset to spin up properly.

How to reproduce it (as minimally and precisely as possible)? This is unpredictable. I can fix the issue by destroying a node, and when a new node spins up, the /var/lib/kubelet/plugins_registry/efs.csi.aws.com-reg.sock file exists and the pods work as expected.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): v1.27.7-eks-4f4795d
  • Driver version: 2.5.0

ryanhockstad avatar Dec 01 '23 20:12 ryanhockstad

Bumping this. Any info would be very helpful. Seeing this a lot.

michaelajr avatar Jan 31 '24 19:01 michaelajr

In my case, Pods are not initializing due to FailedMount event. When connected to the Node and checked the /var/lib/plugins_registry/ , did not have the "efs.csi.aws.com-reg.sock" file. Checked the logs of "CSI Driver Registrar" the logs look normal. Also brief about the EKS cluster, I have 1 worker node and 2nd node is created dynamically and efs-csi-node daemonset does the required setup.

Also if all the workloads on the static worker node are removed and then a new node is created dynamically, then the file "efs.csi.aws.com-reg.sock" is created properly and volume is mounted successfully.

The same setup I have it in a different cluster which is working pretty fine.

sasanknvs avatar Mar 04 '24 07:03 sasanknvs