aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

Failed to create registration probe file error after updating to v2.4.4

Open headyj opened this issue 2 years ago • 14 comments

/kind bug

What happened?

After updating from v2.4.3 to v2.4.4, memory usage seems to have more than doubled on efs-csi-node pods, some of them also seems to have memory leaks.

Also seeing this error on csi-driver-registrar container: Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"

the main consequence seems to be that some of the pods refused to mount nodes, with timeout waiting for condition error.

How to reproduce it (as minimally and precisely as possible)?

Just update from v2.4.3 to v2.4.4

Environment

  • Kubernetes version (use kubectl version): v1.26 (EKS)
  • Driver version: v2.4.4

headyj avatar Jun 08 '23 12:06 headyj

As with this issue, it seems on my side that rolling back to v2.4.3 is not sufficient, as some efs-csi-node pods were still having their memory usage growing endlessly (even after a restart). Only draining and replacing the nodes on which these efs-csi-node pods were running seems to solve the issue

headyj avatar Jun 08 '23 13:06 headyj

Hi @headyj, Can you install latest helm version 2.4.5 and see whether the issue still persists. If yes, can you share the debugging logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

mskanth972 avatar Jun 09 '23 13:06 mskanth972

I will not be able to test the memory leak issue, as it's basically breaking our environments. What I can see is despite the Failed to create registration probe file error, draining an replacing the nodes seems to fix the memory leak issue.

What I can tell you though is that I still have the Failed to create registration probe file with 2.4.5. I tried to set logging_level to DEBUG as explained in troubleshooting but it doesn't seems to work without restarting the pod (and obviously loosing the changes). Also I tried to set v=5 in efs-csi-node daemonset as well as efs-csi-controller deployment but still not much to see in the logs on both side:

  • efs-csi-controller (csi-provisioner)
W0612 08:58:02.113393       1 feature_gate.go:241] Setting GA feature gate Topology=true. It will be removed in a future release.
I0612 08:58:02.113459       1 feature_gate.go:249] feature gates: &{map[Topology:true]}
I0612 08:58:02.113507       1 csi-provisioner.go:154] Version: v3.5.0
I0612 08:58:02.113533       1 csi-provisioner.go:177] Building kube configs for running in cluster...
I0612 08:58:03.155569       1 common.go:111] Probing CSI driver for readiness
I0612 08:58:03.159033       1 csi-provisioner.go:230] Detected CSI driver efs.csi.aws.com
I0612 08:58:03.161060       1 csi-provisioner.go:302] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I0612 08:58:03.161732       1 controller.go:732] Using saving PVs to API server in background
I0612 08:58:03.162378       1 leaderelection.go:245] attempting to acquire leader lease kube-system/efs-csi-aws-com...
  • efs-csi-controller (liveness-probe)
I0612 08:58:02.263863       1 main.go:149] calling CSI driver to discover driver name
I0612 08:58:02.266344       1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 08:58:02.266388       1 main.go:183] ServeMux listening at "0.0.0.0:9909"
  • efs-csi-controller (efs-plugin)
I0612 08:58:02.158386       1 config_dir.go:63] Mounted directories do not exist, creating directory at '/etc/amazon/efs'
I0612 08:58:02.160552       1 metadata.go:63] getting MetadataService...
I0612 08:58:02.162319       1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 08:58:02.163244       1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 08:58:02.163262       1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 08:58:02.163367       1 driver.go:140] Did not find any input tags.
I0612 08:58:02.163544       1 driver.go:113] Registering Node Server
I0612 08:58:02.163553       1 driver.go:115] Registering Controller Server
I0612 08:58:02.163562       1 driver.go:118] Starting efs-utils watchdog
I0612 08:58:02.163706       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0612 08:58:02.163829       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0612 08:58:02.164879       1 driver.go:124] Starting reaper
I0612 08:58:02.164894       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/var/lib/csi/sockets/pluginproxy/csi.sock", Net:"unix"}
I0612 08:58:03.159309       1 identity.go:37] GetPluginCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0612 08:58:03.160257       1 controller.go:417] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
  • efs-csi-node (liveness-probe)
I0612 09:00:09.421248       1 main.go:149] calling CSI driver to discover driver name
I0612 09:00:09.422425       1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 09:00:09.422449       1 main.go:183] ServeMux listening at "0.0.0.0:9809"
  • efs-csi-node (csi-driver-registrar)
I0612 09:00:09.283756       1 main.go:167] Version: v2.8.0
I0612 09:00:09.283823       1 main.go:168] Running node-driver-registrar in mode=registration
I0612 09:00:09.284438       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0612 09:00:09.289374       1 main.go:199] Calling CSI driver to discover driver name
I0612 09:00:09.293048       1 node_register.go:53] Starting Registration Server at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293221       1 node_register.go:62] Registration Server started at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293453       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0612 09:00:09.533955       1 main.go:102] Received GetInfo call: &InfoRequest{}
E0612 09:00:09.534095       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"
I0612 09:00:09.560524       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

-efs-csi-node (efs-plugin)

I0612 09:00:09.180667       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0612 09:00:09.182325       1 metadata.go:63] getting MetadataService...
I0612 09:00:09.184196       1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 09:00:09.185214       1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 09:00:09.185253       1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 09:00:09.185345       1 driver.go:140] Did not find any input tags.
I0612 09:00:09.185607       1 driver.go:113] Registering Node Server
I0612 09:00:09.185637       1 driver.go:115] Registering Controller Server
I0612 09:00:09.185650       1 driver.go:118] Starting efs-utils watchdog
I0612 09:00:09.185743       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0612 09:00:09.185761       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0612 09:00:09.186166       1 driver.go:124] Starting reaper
I0612 09:00:09.186179       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0612 09:00:09.535518       1 node.go:306] NodeGetInfo: called with args

headyj avatar Jun 12 '23 09:06 headyj

Hi @headyj , the error is saying that the file system is read only - Have you checked the security group for the efs file system you are using and the inbound rules within the security group? The security group needs an inbound rule that accepts NFS traffic. More info on how the file system should be configured is here.

If the inbound rule is configured properly, you can follow this document to change the read only setting within your efs file system. More info on the node-driver-registrar is here.

arnavgup1 avatar Jun 14 '23 21:06 arnavgup1

Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.

Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin

IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using --force. But probably they continue to run endlessly on the node

headyj avatar Jul 28 '23 14:07 headyj

I've noted this new error too. The error does say err="mkdir /var/lib/kubelet: read-only file system" so I assume it is the container filesystem that is being attempted to be written to?

I notice newer 2.4.4 version of the helm chart have added readOnlyRootFilesystem: true to most of the containers. These were not present in earlier 2.4.3 version of the chart. @headyj try patching the chart to make readOnlyRootFilesystem: false and see if that fixes it for you?

        - name: csi-driver-registrar
 ...
+           securityContext:
+             allowPrivilegeEscalation: false
+             readOnlyRootFilesystem: true

Looks like this commit to the 2.4.4 helm chart may have broken things: https://github.com/kubernetes-sigs/aws-efs-csi-driver/commit/eb6e3eadabafc3621ad892ef7ecbf7577e24705f

Good news is you can override this in the chart deployment values: https://github.com/kubernetes-sigs/aws-efs-csi-driver/commit/eb6e3eadabafc3621ad892ef7ecbf7577e24705f#diff-56338152bc066c1274cc12e455c5d0585a0ce0cb30831547f47a758d2a750862R36-R47

whereisaaron avatar Jul 29 '23 17:07 whereisaaron

have the same issue after updating to 2.4.9 helm chart. Fixed by setting

  sidecars:
    nodeDriverRegistrar:
      securityContext:
        readOnlyRootFilesystem: false

evheniyt avatar Aug 25 '23 08:08 evheniyt

Registration probe error has been around forever... https://github.com/kubernetes-csi/node-driver-registrar/issues/213. It's also showing for eks addon too.

mkim37 avatar Sep 08 '23 01:09 mkim37

With the latest Helm chart I'm getting this:

E0918 16:02:01.703074       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"

It looks like it expects this folder to be mounted from the host, but looking at the volume mounts of this container: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/cb9d97d67ca1fb152ad860f8656bfcedb1f7cfc3/charts/aws-efs-csi-driver/templates/node-daemonset.yaml#L131-L135

This is not mounted from the host, hence it ends up on the container root filesystem, which is configured as read-only.

alfredkrohmer avatar Sep 26 '23 06:09 alfredkrohmer

Problem is noted in Kubernetes docs -> https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation in this document there is a warning which is the actual problem.

Only containers privileged will be able to use mountPropagation: "Bidirectional" https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/deploy/kubernetes/base/node-daemonset.yaml#L68 (same applies to Helm Chart).

So problem here is it is trying to propagate that volume from container efs-plugin to csi-driver-registrar but this last is not priveleged.

Quicker fix is to make that csi-driver-registrar so for example if using Helm you will need:

sidecars:
  nodeDriverRegistrar:
    securityContext:
      privileged: true
      allowPrivilegeEscalation: true

Of course this is still a bug and needs a fix, or that csi-driver-registrar is privileged by default or the volumeMount is set explicitly in that container too.

gcaracuel avatar Oct 24 '23 13:10 gcaracuel

If I understand the comment on the linked issue right, v2.9.0 of the csi-driver-registrar would be able to deal with readOnlyFileSystem?

See https://github.com/kubernetes-csi/node-driver-registrar/issues/213#issuecomment-1780286757

the-technat avatar Dec 07 '23 10:12 the-technat

Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.

Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin

IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using --force. But probably they continue to run endlessly on the node

We are using the version v1.5.4 of efs-csi-driver, it also has the memory leak issue, no matter how many memory I give to it, some of the efs-csi-nodes will have OOMKilled. Even I changed the readOnlyRootFilesystem to false.

jiangfwa avatar Feb 08 '24 01:02 jiangfwa