aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard
Failed to create registration probe file error after updating to v2.4.4
/kind bug
What happened?
After updating from v2.4.3 to v2.4.4, memory usage seems to have more than doubled on efs-csi-node pods, some of them also seems to have memory leaks.
Also seeing this error on csi-driver-registrar container:
Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"
the main consequence seems to be that some of the pods refused to mount nodes, with timeout waiting for condition error.
How to reproduce it (as minimally and precisely as possible)?
Just update from v2.4.3 to v2.4.4
Environment
- Kubernetes version (use
kubectl version): v1.26 (EKS) - Driver version: v2.4.4
As with this issue, it seems on my side that rolling back to v2.4.3 is not sufficient, as some efs-csi-node pods were still having their memory usage growing endlessly (even after a restart). Only draining and replacing the nodes on which these efs-csi-node pods were running seems to solve the issue
Hi @headyj, Can you install latest helm version 2.4.5 and see whether the issue still persists. If yes, can you share the debugging logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting
I will not be able to test the memory leak issue, as it's basically breaking our environments. What I can see is despite the Failed to create registration probe file error, draining an replacing the nodes seems to fix the memory leak issue.
What I can tell you though is that I still have the Failed to create registration probe file with 2.4.5. I tried to set logging_level to DEBUG as explained in troubleshooting but it doesn't seems to work without restarting the pod (and obviously loosing the changes). Also I tried to set v=5 in efs-csi-node daemonset as well as efs-csi-controller deployment but still not much to see in the logs on both side:
- efs-csi-controller (csi-provisioner)
W0612 08:58:02.113393 1 feature_gate.go:241] Setting GA feature gate Topology=true. It will be removed in a future release.
I0612 08:58:02.113459 1 feature_gate.go:249] feature gates: &{map[Topology:true]}
I0612 08:58:02.113507 1 csi-provisioner.go:154] Version: v3.5.0
I0612 08:58:02.113533 1 csi-provisioner.go:177] Building kube configs for running in cluster...
I0612 08:58:03.155569 1 common.go:111] Probing CSI driver for readiness
I0612 08:58:03.159033 1 csi-provisioner.go:230] Detected CSI driver efs.csi.aws.com
I0612 08:58:03.161060 1 csi-provisioner.go:302] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I0612 08:58:03.161732 1 controller.go:732] Using saving PVs to API server in background
I0612 08:58:03.162378 1 leaderelection.go:245] attempting to acquire leader lease kube-system/efs-csi-aws-com...
- efs-csi-controller (liveness-probe)
I0612 08:58:02.263863 1 main.go:149] calling CSI driver to discover driver name
I0612 08:58:02.266344 1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 08:58:02.266388 1 main.go:183] ServeMux listening at "0.0.0.0:9909"
- efs-csi-controller (efs-plugin)
I0612 08:58:02.158386 1 config_dir.go:63] Mounted directories do not exist, creating directory at '/etc/amazon/efs'
I0612 08:58:02.160552 1 metadata.go:63] getting MetadataService...
I0612 08:58:02.162319 1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 08:58:02.163244 1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 08:58:02.163262 1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 08:58:02.163367 1 driver.go:140] Did not find any input tags.
I0612 08:58:02.163544 1 driver.go:113] Registering Node Server
I0612 08:58:02.163553 1 driver.go:115] Registering Controller Server
I0612 08:58:02.163562 1 driver.go:118] Starting efs-utils watchdog
I0612 08:58:02.163706 1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0612 08:58:02.163829 1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0612 08:58:02.164879 1 driver.go:124] Starting reaper
I0612 08:58:02.164894 1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/var/lib/csi/sockets/pluginproxy/csi.sock", Net:"unix"}
I0612 08:58:03.159309 1 identity.go:37] GetPluginCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0612 08:58:03.160257 1 controller.go:417] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
- efs-csi-node (liveness-probe)
I0612 09:00:09.421248 1 main.go:149] calling CSI driver to discover driver name
I0612 09:00:09.422425 1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 09:00:09.422449 1 main.go:183] ServeMux listening at "0.0.0.0:9809"
- efs-csi-node (csi-driver-registrar)
I0612 09:00:09.283756 1 main.go:167] Version: v2.8.0
I0612 09:00:09.283823 1 main.go:168] Running node-driver-registrar in mode=registration
I0612 09:00:09.284438 1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0612 09:00:09.289374 1 main.go:199] Calling CSI driver to discover driver name
I0612 09:00:09.293048 1 node_register.go:53] Starting Registration Server at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293221 1 node_register.go:62] Registration Server started at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293453 1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0612 09:00:09.533955 1 main.go:102] Received GetInfo call: &InfoRequest{}
E0612 09:00:09.534095 1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"
I0612 09:00:09.560524 1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
-efs-csi-node (efs-plugin)
I0612 09:00:09.180667 1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0612 09:00:09.182325 1 metadata.go:63] getting MetadataService...
I0612 09:00:09.184196 1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 09:00:09.185214 1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 09:00:09.185253 1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 09:00:09.185345 1 driver.go:140] Did not find any input tags.
I0612 09:00:09.185607 1 driver.go:113] Registering Node Server
I0612 09:00:09.185637 1 driver.go:115] Registering Controller Server
I0612 09:00:09.185650 1 driver.go:118] Starting efs-utils watchdog
I0612 09:00:09.185743 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0612 09:00:09.185761 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0612 09:00:09.186166 1 driver.go:124] Starting reaper
I0612 09:00:09.186179 1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0612 09:00:09.535518 1 node.go:306] NodeGetInfo: called with args
Hi @headyj , the error is saying that the file system is read only - Have you checked the security group for the efs file system you are using and the inbound rules within the security group? The security group needs an inbound rule that accepts NFS traffic. More info on how the file system should be configured is here.
If the inbound rule is configured properly, you can follow this document to change the read only setting within your efs file system. More info on the node-driver-registrar is here.
Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.
Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin
IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using --force. But probably they continue to run endlessly on the node
I've noted this new error too. The error does say err="mkdir /var/lib/kubelet: read-only file system" so I assume it is the container filesystem that is being attempted to be written to?
I notice newer 2.4.4 version of the helm chart have added readOnlyRootFilesystem: true to most of the containers. These were not present in earlier 2.4.3 version of the chart. @headyj try patching the chart to make readOnlyRootFilesystem: false and see if that fixes it for you?
- name: csi-driver-registrar
...
+ securityContext:
+ allowPrivilegeEscalation: false
+ readOnlyRootFilesystem: true
Looks like this commit to the 2.4.4 helm chart may have broken things: https://github.com/kubernetes-sigs/aws-efs-csi-driver/commit/eb6e3eadabafc3621ad892ef7ecbf7577e24705f
Good news is you can override this in the chart deployment values: https://github.com/kubernetes-sigs/aws-efs-csi-driver/commit/eb6e3eadabafc3621ad892ef7ecbf7577e24705f#diff-56338152bc066c1274cc12e455c5d0585a0ce0cb30831547f47a758d2a750862R36-R47
have the same issue after updating to 2.4.9 helm chart. Fixed by setting
sidecars:
nodeDriverRegistrar:
securityContext:
readOnlyRootFilesystem: false
Registration probe error has been around forever... https://github.com/kubernetes-csi/node-driver-registrar/issues/213. It's also showing for eks addon too.
With the latest Helm chart I'm getting this:
E0918 16:02:01.703074 1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"
It looks like it expects this folder to be mounted from the host, but looking at the volume mounts of this container: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/cb9d97d67ca1fb152ad860f8656bfcedb1f7cfc3/charts/aws-efs-csi-driver/templates/node-daemonset.yaml#L131-L135
This is not mounted from the host, hence it ends up on the container root filesystem, which is configured as read-only.
Problem is noted in Kubernetes docs -> https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation in this document there is a warning which is the actual problem.
Only containers privileged will be able to use mountPropagation: "Bidirectional" https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/deploy/kubernetes/base/node-daemonset.yaml#L68 (same applies to Helm Chart).
So problem here is it is trying to propagate that volume from container efs-plugin to csi-driver-registrar but this last is not priveleged.
Quicker fix is to make that csi-driver-registrar so for example if using Helm you will need:
sidecars:
nodeDriverRegistrar:
securityContext:
privileged: true
allowPrivilegeEscalation: true
Of course this is still a bug and needs a fix, or that csi-driver-registrar is privileged by default or the volumeMount is set explicitly in that container too.
If I understand the comment on the linked issue right, v2.9.0 of the csi-driver-registrar would be able to deal with readOnlyFileSystem?
See https://github.com/kubernetes-csi/node-driver-registrar/issues/213#issuecomment-1780286757
Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.
Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin
IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using
--force. But probably they continue to run endlessly on the node
We are using the version v1.5.4 of efs-csi-driver, it also has the memory leak issue, no matter how many memory I give to it, some of the efs-csi-nodes will have OOMKilled. Even I changed the readOnlyRootFilesystem to false.