aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard
Mount hanging indefinitely
/kind bug
What happened?
Processes accessing the nfs mount enter an uninterupable sleep trying to access the disk. Happens weekly on different machines in an EKS cluster. dmesg reports the only error with nfs server not responding. No errors on efs-csi-node pod.
What you expected to happen?
The mounts stay responsive.
How to reproduce it (as minimally and precisely as possible)?
This is an EKS cluster running 1.21 with 13 nodes. Everything works fine for an undetermined amount of time until the mount becomes unresponsive.
Anything else we need to know?:
Possibly related to https://github.com/aws/efs-utils/issues/61
sh-4.2$ nc -v localhost 20261 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 127.0.0.1:20261.
sh-4.2$ nc -v fs-0fe7dd1e29e866c20.efs.eu-west-2.amazonaws.com 2049 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 192.168.73.78:2049.
dmesg
[2335453.756731] INFO: task redis-server:17623 blocked for more than 1208 seconds.
[2335453.763204] Not tainted 5.4.141-67.229.amzn2.x86_64 #1
[2335453.767340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2335453.774015] redis-server D 0 17623 17602 0x00000084
[2335453.778080] Call Trace:
[2335453.781004] __schedule+0x267/0x6c0
[2335453.838923] ? bit_wait+0x50/0x50
[2335453.842265] schedule+0x2f/0xa0
[2335453.845475] io_schedule+0x12/0x40
[2335453.848708] bit_wait_io+0xd/0x50
[2335453.851921] __wait_on_bit+0x2c/0x90
[2335453.855186] out_of_line_wait_on_bit+0x91/0xb0
[2335453.858814] ? init_wait_var_entry+0x40/0x40
[2335453.862425] nfs_lock_and_join_requests+0x8f/0x520 [nfs]
[2335453.866441] ? xas_load+0x8/0x80
[2335453.869618] ? find_get_entry+0xb4/0x150
[2335453.873039] nfs_updatepage+0x169/0x940 [nfs]
[2335453.876585] nfs_write_end+0x104/0x460 [nfs]
[2335453.880156] ? copyin+0x22/0x30
[2335453.883270] generic_perform_write+0xfc/0x1b0
[2335453.886842] nfs_file_write+0x107/0x2c0 [nfs]
[2335453.890479] new_sync_write+0x11b/0x1b0
[2335453.893804] vfs_write+0xad/0x1a0
[2335453.896988] ksys_write+0xa1/0xe0
[2335453.900191] do_syscall_64+0x48/0xf0
[2335453.903436] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[2335453.907280] RIP: 0033:0x7fd54efbf4a7
[2335453.910625] Code: Bad RIP value.
[2335453.913824] RSP: 002b:00007ffed8afcff0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[2335453.920318] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fd54efbf4a7
[2335453.926603] RDX: 0000000000000053 RSI: 00007fd544e43005 RDI: 0000000000000021
[2335453.932929] RBP: 00007fd544e43005 R08: 0000000000000002 R09: 000000003aadeee6
[2335453.939238] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000053
[2335453.945607] R13: 0000000000000021 R14: 0000000000000000 R15: 000055d493f5e520
[2383821.915870] nfs: server 127.0.0.1 not responding, still trying
Environment
- Kubernetes version (use
kubectl version
):
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
- Driver version: 1.3.5
same issue, I noticed this issue because of:
- High cpu load on the node, but with low cpu usage
- No response from
df -h
- Hang at efs mounting by
strace df
- Reporting
nfs: server 127.0.0.1 not responding, still trying
bydmesg
eks vesion: 1.21.5 driver version: 1.3.5
Edited: I upgraded driver version to 1.3.6, still the same issue, I think this is a critical P1 bug.
We had this again. I also have an open P1 support issue with AWS.
Interestingly a new pod started on the same machine and was able to connect to EFS. So it looks like a disconnect but it never recovers when we have this issue.
Mount options:
(rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20073,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)
@markmcdowell did you get any useful response from AWS?
Actually I have 2 EKS clusters: dev
and prod
,
The only difference between the 2 is the instance size, I use t3 in dev
and c5 in prod
, but I never see the issue happens in my dev cluster, so I changed my instance type to t3 in prod 9 days ago, at least till now it works very well without issues, will continue to keep eyes on it.
I didn’t get anything useful no, other than they thought it may coincide with releases they did.
We didn’t have the issue for 2 weeks then had it twice in 1 day. We’ve moved off efs and back to gp3 storage as it was the only way we could avoid it.
it happens again today, my pod lost connection to efs until I terminated the node :(
We're seeing the same thing. Idle CPU spike. First couple of times we panicked and just wiped the stuck nodes. Tonight we were able to fix by removing the efs-csi-node pod on the stuck node. Anyone know if we can downgrade to avoid this issue? We're running amazon/aws-efs-csi-driver:v1.3.6 on v1.20.11-eks-f17b81
We are facing the same situation.
Kubernetes Version: 1.17.2 Operating System: Amazon Linux 2 Kernel Version: 4.14.219-161.340.amzn2.x86_64 Docker Server Version: 20.10.7 aws-efs-csi-driver Version: amazon/aws-efs-csi-driver:v1.3.6
We build our own Kubernetes cluster based on EC2, with Kubernetes workers distributed in two AZs in AWS. Currently, all the containers in AZ 1 are unable to access the EFS volume. The containers running in AZ 2 can access the volume normally. An EC2 with 8 cores vCPU, 4 of the cores are 100% utilized all the time, most of them are iowait. Executing df -h will hung.
In /var/log/message a large number of errors are reported as follows Mar 5 03:33:31 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:35 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:39 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:43 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:43 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out
We were able to mount it successfully in all Kubernetes Worker by "mount -t nfs <efs_endpoint>:/ /mnt". All the endpoints of both AZs are mounted properly.
We're seeing the same thing. Idle CPU spike. First couple of times we panicked and just wiped the stuck nodes. Tonight we were able to fix by removing the efs-csi-node pod on the stuck node. Anyone know if we can downgrade to avoid this issue? We're running amazon/aws-efs-csi-driver:v1.3.6 on v1.20.11-eks-f17b81
I trigger a container restart by executing "kill 1" in the efs-csi-node pods to which the affected AZ belongs. Those containers with mounted EFS volumes can temporarily resume functionality.
@markmcdowell , Can you provide more details about the workflow executed? Can you collect logs using log_collector script
Adding a workaround here for anyone who needs it, the following script monitors the stunnel processes running on the worker node and kills them if they are unresponsive. The killed processes are then re-created by the EFS CSI driver. The stunnel processes are used to add TLS encryption to NFS traffic, you can read more about it here
#!/bin/bash
WAIT_TIME=60 # Time in seconds to wait before polling stunnel processes
TIMEOUT=2 # Time in seconds to wait for response from remote EFS system connected via stunnel
while true; do
STUNNEL_NFS_MOUNTS=`mount | grep nfs4 | grep 127.0.0.1 | awk '{print $3}'`
for STUNNEL_MOUNT in $STUNNEL_NFS_MOUNTS; do
timeout $TIMEOUT stat -t $STUNNEL_MOUNT >/dev/null 2>&1
if [ $? -eq 124 ]; then
echo "[`date`] Timeout when connecting to NFS, killing stunnel process"
NAME=`echo $STUNNEL_MOUNT | sed -r 's:/:.:g'`
PROCESS=`ps -ef | grep stunnel | grep $NAME`
PROCESS_NAME=`echo $PROCESS | awk '{ print $8, $9}'`
PROCESS_PID=`echo $PROCESS | awk '{print $2}'`
kill -9 $PROCESS_PID
echo "[`date`] Killed process $PROCESS_PID '$PROCESS_NAME'"
fi
done
sleep $WAIT_TIME
done
The above script can be invoked in a regular crontab entry or as a systemd service (see below). Running it as a daemonset is also possible but you would need to run the container as root, mount /var/lib/kubelet
to see NFS mount points created by the EFS CSI Driver and specify hostPID: true
in pod spec to view and kill stunnel processes
[Unit]
Description=Monitor health of stunnel processes created by the EFS CSI driver
[Service]
ExecStart=<SCRIPT_LOCATION>
[Install]
WantedBy=multi-user.target
If you are using EKS managed nodegroups with a launch template specified (see here), you can use EC2 user data to make the above configuration changes as new EKS worker nodes are being provisioned.
If anyone else wants to try this workaround, feel free to do so but please test before deploying to a production environment.
@kvifern thanks for your workaround, I am testing it in our development environment.
For anyone who wants to run this as a daemonset - https://gist.github.com/0nebody/f55a376f2c0ec3f9968168d271763fca
Hey folks,
Has anyone tried to upgrade their stunnel version to latest 5.63 on the node and see whether the issue is happening anymore?
The stunnel shipped with efs-utils is pretty old (4.56) due to various reason. We suspect there is something wrong with the stunnel itself that the re-establishment of connection failed silently, further stunnel cannot connect back to server, which cause the nfs not responding issue. e.g. from stunnel debug log
# Failed when connect to mount target
2022.03.17 02:19:58 LOG7[1105:139990955783936]: connect_blocking: s_poll_wait $MT_IP:2049: waiting 10 seconds -> There should be a `TIMEOUTconnect exceeded` failure if the connection failed, but there is no info about how is the waiting going on
2022.03.17 02:56:09 LOG7[1105:139990955789440]: Dispatching signals from the signal pipe
# Failed when hand shake
2022.03.23 16:10:49 LOG6[19003:140551554406144]: CERT: Host name "$FS_ID.efs.us-east-1.amazonaws.com" matched with "*.efs.us-east-1.amazonaws.com" -> Again, stunnel stopped working in middle of the connection re-establishment
2022.03.23 16:56:09 LOG7[19003:140551554411648]: Dispatching signals from the signal pipe
Hi @0nebody, Nice job with the workaround daemonset, just be aware that you may not need to mount /proc since "hostPID" is already set to true. This means that the containers within the pod would use the host PID namespace and will already be able to see and interact with the processes in the host namespace.
If you need to test the effect of the script, you can temporarily update your EFS security group to remove/block access on port 2049. This would cause the NFS timeouts to eventually appear in kernel logs which would then trigger the script to execute.
Hi @Cappuccinuo,
Stunnel is not included/installed in the EKS optimised AMI/worker nodes, instead it is installed within the EFS CSI Driver image (see here) which is running as a Daemonset in an EKS cluster. This docker image seems to clone the efs-utils
GitHub repo and install the amazon-efs-utils
package as per this spec file which specifies stunnel version 4.56. Therefore upgrading it is a bit challenging, it looks like you have to build your own EFS CSI driver image manually to update stunnel. Is it possible for a docker image version/tag of the EFS CSI driver to be available with stunnel upgraded to 5.63?
Thanks @kvifern ,
Yea haven't figure out that part for stunnel upgrade, as the stunnel is a dependency of efs-utils, the stunnel exec itself should be installed on the node itself. The upgrade logic can be in the docker file.
For the workaround script, instead we should use df
command, as stat
command will take advantage of the kernel cache, so it may detect the unhealthy stunnel not on time due to the result is served from page cache.
Post here for a general usable script for both EC2 and csi-driver. Need to install jq
package first with sudo yum -y install jq
.
This script will read the stunnel process id from state file of the mount.
# stunnel_health_check.sh
#!/bin/bash
if [ ! `jq --version` ]; then
echo "Please install jq package first"
fi
WAIT_TIME=60 # Time in seconds to wait before polling stunnel processes
TIMEOUT=2 # Time in seconds to wait for response from remote EFS system connected via stunnel
echo "Running stunnel health check script"
while true; do
STUNNEL_NFS_MOUNTS=`mount | grep nfs4 | grep 127.0.0.1 | awk '{print $3}'`
for STUNNEL_MOUNT in $STUNNEL_NFS_MOUNTS; do
echo "Checking stunnel health for $STUNNEL_MOUNT"
timeout $TIMEOUT df $STUNNEL_MOUNT >/dev/null 2>&1
if [ $? -eq 124 ]; then
echo "[`date`] Timeout when connecting to NFS, killing stunnel process"
PORT=`mount | grep nfs4 | grep 127.0.0.1 | grep $STUNNEL_MOUNT | grep -o 'port=[0-9]\+' | cut -d "=" -f 2-`
STATE_FILE=`ls -1 /var/run/efs | grep -v stunnel-config.* | grep $PORT$`
PROCESS_PID=`cat /var/run/efs/$STATE_FILE | jq '.pid'`
kill -9 $PROCESS_PID
echo "[`date`] Killed stunnel process pid: $PROCESS_PID mount: $STUNNEL_MOUNT port: $PORT"
else
echo "Stunnel health OK for $STUNNEL_MOUNT"
fi
done
sleep $WAIT_TIME
done
Record the script location, and touch the following file in /lib/systemd/system
, remember to replace the /home/ec2-user/stunnel_health_check.sh
with the location on your own instance.
# amazon-efs-tls-mount-stunnel-health-monitor.service
[Unit]
Description=Monitor health of stunnel processes created by the tls mounts using efs-utils
Requires=amazon-efs-mount-watchdog.service
[Service]
ExecStart=/bin/bash /home/ec2-user/stunnel_health_check.sh
[Install]
WantedBy=multi-user.target
Then run the following command to enable the health monitor service.
sudo systemctl enable amazon-efs-tls-mount-stunnel-health-monitor.service
sudo systemctl start amazon-efs-tls-mount-stunnel-health-monitor.service
You can check the stunnel monitor service by running the following command
sudo systemctl status amazon-efs-tls-mount-stunnel-health-monitor.service
To stop the service
sudo systemctl stop amazon-efs-tls-mount-stunnel-health-monitor.service
sudo systemctl disable amazon-efs-tls-mount-stunnel-health-monitor.service
To check the logs
journalctl -u amazon-efs-tls-mount-stunnel-health-monitor.service
Hey folks,
You can try the latest driver to see if the issue is fixed. Thanks.
kubectl kustomize "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.3" > driver.yaml
kubectl apply -f driver.yaml
Hi @Cappuccinuo , that would be great. ~~But i can not see any closed PR that tackles this issue here. What change might prevent the hanging mounts?~~ Just found the connected efs-utils v1.32.1 in the release notes. We will try the new fix.
@Cappuccinuo it seems to me that the EFS Utils stunnel watcher runs on a 5 minute interval.
Are we able to speed it up on our end? We were acustomed to the 60 second interval with the fix above.
@msolberg8 Yes this can be changed on the instance efs-utils config file(/etc/amazon/efs/efs-utils.conf), https://github.com/aws/efs-utils/blob/master/dist/efs-utils.conf#L67-L70
@Cappuccinuo thanks!!
Any idea why now all of a sudden stunnel needs a healthcheck? While we're very thankful for the update, this seems like a bandaid rather than a root cause fix.
Also, are there plans to update the helm chart to support easier configuration of the stunnel healthcheck?
@msolberg8
As far as I can tell the culprit is the stunnel software itself, the process stucks and becomes zombie process after certain period, and the only way to fix this is terminate it and let watchdog launch a new one. Unless people who try with stunnel 5.63 and find out the same issue happens then we can report to stunnel side and engage them for a root cause, otherwise they won't look into an old version. I tried with 5.63 and haven't encountered any hanging issue so far.
For the helm chart one, what's the experience if you need to modify the configuration? I bet you need to modify the efs-utils.conf file (/etc/amazon/efs/efs-utils.conf) on the node instance right?
+1 to setting this with the helm chart. Updating all of the node groups with custom launch templates is quite the chore to configure this one setting.
@Cappuccinuo is there any plan in place to upgrade the stunnel in the official version to 5.63?
From what I understand stunnel is pulled in when aws-efs-utils is installed. Would it be worth making an issue over there?
And to answer your question above yes we would need to set that up at the node level which we'd rather not do. We try and keep the nodes as plain jane as we can.
Yea we are working on efs-utils side, you can track https://github.com/aws/efs-utils/issues/99. We need to coordinate with repository side to publish stunnel 5.x.
subscribing as we have a similar issue but killing stunnel doesn't fix it for us still
Jun 7 01:00:38 [ip] kernel: nfs: server 127.0.0.1 not responding, timed out
is there some other things to check/restart ? (vs killing the node; I'd like to understand how it gets in this state and a workaround to unblock it)
edit: killing these 2 processes (ps btw was hanging too!)
root 18314 0.0 0.0 717228 13624 ? Ssl May25 3:21 /bin/aws-efs-csi-driver --endpoint=unix:///var/lib/csi/sockets/pluginproxy/csi.sock --logtostderr --v=2 --delete-access-point-root-dir=false --vol-metrics-opt-in=false
root 20172 0.0 0.1 717228 16288 ? Ssl May25 13:41 /bin/aws-efs-csi-driver --endpoint=unix:/csi/csi.sock --logtostderr --v=2
and back to
Jun 7 01:04:16 [ip] kernel: nfs: server 127.0.0.1 OK
In our case the NFS issue was triggered by a pod overloading the node. We got it resolved by setting resource limits to the problematic deployment.
I'm facing exactly the same issue as pointed by https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1022959423
k8s version: v1.21.12-eks-a64ea69 aws-efs-csi-driver: v1.3.3
Killing the stunnel process, as mentioned by in https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1022959423 and https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1072965716, didn't help me. I also tried https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1148085899 without any success.
My observation:
- This issue was happening on nodes with pods using a very particular efs volume, x.
- I deleted the PV and PVC followed by recreating it.
- After force killing the pod, I saw the following in the kubelet logs:
26930 kubelet_volumes.go:225] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"2174a688-3672-4d6e-a775-58627b61d629\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=3
26930 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-68ece2b9::fsap-078492ba97b679d61 podName:3b06e0d6-63c8-45fc-bcde-6906cdd0d395 nodeName:}" failed. No retries permitted until 2022-06-08 07:06:03.158716657 +0000 UTC m=+5949202.959554767 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"x\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-68ece2b9::fsap-078492ba97b679d61\") pod \"3b06e0d6-63c8-45fc-bcde-6906cdd0d395\" (UID: \"3b06e0d6-63c8-45fc-bcde-6906cdd0d395\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
-
For the buggy mount path for "x", umount failed. umount -f showed the device was busy. So I did umount -l. “df” started working after that. But we were still seeing “nfs: server 127.0.0.1 not responding, timed out” But new nfs mount and umount is still working.
-
To make matters worst existing mounts stop working.
Faced exact same issue., we were using 1.3.6 tag
Observations
- Node load avg keeps increasing (In some nodes with 16 core it went high as 3k)
- We can observe nfs: server 127.0.0.1 not responding, timed out
- Our main app process completely stuck trying to access the disk.
- df or ls from the problematic mounted directory hangs
For us, the tunnel process went defunct we have to tweak workaround script to kill all defunct tunnel process in a loop.
Fix We got know that 1.3.7 tag has the fix for this. they have implemented watchdog and restart the hanging process. fix is done in utils v1.32.1 and is being part of release 1.3.7 tag changelog
Thanks @markmcdowell for starting this thread, it is very useful.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Still waiting for https://github.com/aws/efs-utils/issues/99 to be closed, then a new version of the driver should be available.