aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

Mount hanging indefinitely

Open markmcdowell opened this issue 2 years ago • 32 comments

/kind bug

What happened?

Processes accessing the nfs mount enter an uninterupable sleep trying to access the disk. Happens weekly on different machines in an EKS cluster. dmesg reports the only error with nfs server not responding. No errors on efs-csi-node pod.

What you expected to happen?

The mounts stay responsive.

How to reproduce it (as minimally and precisely as possible)?

This is an EKS cluster running 1.21 with 13 nodes. Everything works fine for an undetermined amount of time until the mount becomes unresponsive.

Anything else we need to know?:

Possibly related to https://github.com/aws/efs-utils/issues/61

sh-4.2$ nc -v localhost 20261 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 127.0.0.1:20261.

sh-4.2$ nc -v fs-0fe7dd1e29e866c20.efs.eu-west-2.amazonaws.com 2049 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 192.168.73.78:2049.

dmesg

[2335453.756731] INFO: task redis-server:17623 blocked for more than 1208 seconds.
[2335453.763204]       Not tainted 5.4.141-67.229.amzn2.x86_64 #1
[2335453.767340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2335453.774015] redis-server    D    0 17623  17602 0x00000084
[2335453.778080] Call Trace:
[2335453.781004]  __schedule+0x267/0x6c0
[2335453.838923]  ? bit_wait+0x50/0x50
[2335453.842265]  schedule+0x2f/0xa0
[2335453.845475]  io_schedule+0x12/0x40
[2335453.848708]  bit_wait_io+0xd/0x50
[2335453.851921]  __wait_on_bit+0x2c/0x90
[2335453.855186]  out_of_line_wait_on_bit+0x91/0xb0
[2335453.858814]  ? init_wait_var_entry+0x40/0x40
[2335453.862425]  nfs_lock_and_join_requests+0x8f/0x520 [nfs]
[2335453.866441]  ? xas_load+0x8/0x80
[2335453.869618]  ? find_get_entry+0xb4/0x150
[2335453.873039]  nfs_updatepage+0x169/0x940 [nfs]
[2335453.876585]  nfs_write_end+0x104/0x460 [nfs]
[2335453.880156]  ? copyin+0x22/0x30
[2335453.883270]  generic_perform_write+0xfc/0x1b0
[2335453.886842]  nfs_file_write+0x107/0x2c0 [nfs]
[2335453.890479]  new_sync_write+0x11b/0x1b0
[2335453.893804]  vfs_write+0xad/0x1a0
[2335453.896988]  ksys_write+0xa1/0xe0
[2335453.900191]  do_syscall_64+0x48/0xf0
[2335453.903436]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[2335453.907280] RIP: 0033:0x7fd54efbf4a7
[2335453.910625] Code: Bad RIP value.
[2335453.913824] RSP: 002b:00007ffed8afcff0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[2335453.920318] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fd54efbf4a7
[2335453.926603] RDX: 0000000000000053 RSI: 00007fd544e43005 RDI: 0000000000000021
[2335453.932929] RBP: 00007fd544e43005 R08: 0000000000000002 R09: 000000003aadeee6
[2335453.939238] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000053
[2335453.945607] R13: 0000000000000021 R14: 0000000000000000 R15: 000055d493f5e520
[2383821.915870] nfs: server 127.0.0.1 not responding, still trying

Environment

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version: 1.3.5

markmcdowell avatar Jan 18 '22 14:01 markmcdowell

same issue, I noticed this issue because of:

  • High cpu load on the node, but with low cpu usage
  • No response from df -h
  • Hang at efs mounting by strace df
  • Reporting nfs: server 127.0.0.1 not responding, still trying by dmesg

eks vesion: 1.21.5 driver version: 1.3.5

Edited: I upgraded driver version to 1.3.6, still the same issue, I think this is a critical P1 bug.

yongzhang avatar Jan 27 '22 08:01 yongzhang

We had this again. I also have an open P1 support issue with AWS.

Interestingly a new pod started on the same machine and was able to connect to EFS. So it looks like a disconnect but it never recovers when we have this issue.

Mount options: (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20073,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

markmcdowell avatar Feb 03 '22 13:02 markmcdowell

@markmcdowell did you get any useful response from AWS?

Actually I have 2 EKS clusters: dev and prod, The only difference between the 2 is the instance size, I use t3 in dev and c5 in prod, but I never see the issue happens in my dev cluster, so I changed my instance type to t3 in prod 9 days ago, at least till now it works very well without issues, will continue to keep eyes on it.

yongzhang avatar Feb 16 '22 09:02 yongzhang

I didn’t get anything useful no, other than they thought it may coincide with releases they did.

We didn’t have the issue for 2 weeks then had it twice in 1 day. We’ve moved off efs and back to gp3 storage as it was the only way we could avoid it.

markmcdowell avatar Feb 17 '22 12:02 markmcdowell

it happens again today, my pod lost connection to efs until I terminated the node :(

image

yongzhang avatar Feb 21 '22 23:02 yongzhang

We're seeing the same thing. Idle CPU spike. First couple of times we panicked and just wiped the stuck nodes. Tonight we were able to fix by removing the efs-csi-node pod on the stuck node. Anyone know if we can downgrade to avoid this issue? We're running amazon/aws-efs-csi-driver:v1.3.6 on v1.20.11-eks-f17b81

msolberg8 avatar Mar 01 '22 03:03 msolberg8

We are facing the same situation.

Kubernetes Version: 1.17.2 Operating System: Amazon Linux 2 Kernel Version: 4.14.219-161.340.amzn2.x86_64 Docker Server Version: 20.10.7 aws-efs-csi-driver Version: amazon/aws-efs-csi-driver:v1.3.6

We build our own Kubernetes cluster based on EC2, with Kubernetes workers distributed in two AZs in AWS. Currently, all the containers in AZ 1 are unable to access the EFS volume. The containers running in AZ 2 can access the volume normally. An EC2 with 8 cores vCPU, 4 of the cores are 100% utilized all the time, most of them are iowait. Executing df -h will hung.

In /var/log/message a large number of errors are reported as follows Mar 5 03:33:31 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:35 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:39 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:43 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out Mar 5 03:33:43 k8snode01 kernel: nfs: server 127.0.0.1 not responding, timed out

We were able to mount it successfully in all Kubernetes Worker by "mount -t nfs <efs_endpoint>:/ /mnt". All the endpoints of both AZs are mounted properly.

shiying2012 avatar Mar 07 '22 12:03 shiying2012

We're seeing the same thing. Idle CPU spike. First couple of times we panicked and just wiped the stuck nodes. Tonight we were able to fix by removing the efs-csi-node pod on the stuck node. Anyone know if we can downgrade to avoid this issue? We're running amazon/aws-efs-csi-driver:v1.3.6 on v1.20.11-eks-f17b81

I trigger a container restart by executing "kill 1" in the efs-csi-node pods to which the affected AZ belongs. Those containers with mounted EFS volumes can temporarily resume functionality.

shiying2012 avatar Mar 07 '22 12:03 shiying2012

@markmcdowell , Can you provide more details about the workflow executed? Can you collect logs using log_collector script

RDLemos avatar Mar 15 '22 21:03 RDLemos

Adding a workaround here for anyone who needs it, the following script monitors the stunnel processes running on the worker node and kills them if they are unresponsive. The killed processes are then re-created by the EFS CSI driver. The stunnel processes are used to add TLS encryption to NFS traffic, you can read more about it here

#!/bin/bash

WAIT_TIME=60 # Time in seconds to wait before polling stunnel processes
TIMEOUT=2    # Time in seconds to wait for response from remote EFS system connected via stunnel

while true; do
    STUNNEL_NFS_MOUNTS=`mount | grep nfs4 | grep 127.0.0.1 | awk '{print $3}'`
    for STUNNEL_MOUNT in $STUNNEL_NFS_MOUNTS; do
        timeout $TIMEOUT stat -t $STUNNEL_MOUNT >/dev/null 2>&1
        if [ $? -eq 124 ]; then
            echo "[`date`] Timeout when connecting to NFS, killing stunnel process"
            NAME=`echo $STUNNEL_MOUNT | sed -r 's:/:.:g'`
            PROCESS=`ps -ef | grep stunnel | grep $NAME`
            PROCESS_NAME=`echo $PROCESS | awk '{ print $8, $9}'`
            PROCESS_PID=`echo $PROCESS | awk '{print $2}'`
            kill -9 $PROCESS_PID
            echo "[`date`] Killed process $PROCESS_PID '$PROCESS_NAME'"
        fi
    done
    sleep $WAIT_TIME
done

The above script can be invoked in a regular crontab entry or as a systemd service (see below). Running it as a daemonset is also possible but you would need to run the container as root, mount /var/lib/kubelet to see NFS mount points created by the EFS CSI Driver and specify hostPID: true in pod spec to view and kill stunnel processes

[Unit]
Description=Monitor health of stunnel processes created by the EFS CSI driver

[Service]
ExecStart=<SCRIPT_LOCATION>

[Install]
WantedBy=multi-user.target

If you are using EKS managed nodegroups with a launch template specified (see here), you can use EC2 user data to make the above configuration changes as new EKS worker nodes are being provisioned.

If anyone else wants to try this workaround, feel free to do so but please test before deploying to a production environment.

kvifern avatar Mar 19 '22 08:03 kvifern

@kvifern thanks for your workaround, I am testing it in our development environment.

For anyone who wants to run this as a daemonset - https://gist.github.com/0nebody/f55a376f2c0ec3f9968168d271763fca

0nebody avatar Mar 24 '22 23:03 0nebody

Hey folks,

Has anyone tried to upgrade their stunnel version to latest 5.63 on the node and see whether the issue is happening anymore?

The stunnel shipped with efs-utils is pretty old (4.56) due to various reason. We suspect there is something wrong with the stunnel itself that the re-establishment of connection failed silently, further stunnel cannot connect back to server, which cause the nfs not responding issue. e.g. from stunnel debug log

# Failed when connect to mount target
2022.03.17 02:19:58 LOG7[1105:139990955783936]: connect_blocking: s_poll_wait $MT_IP:2049: waiting 10 seconds -> There should be a `TIMEOUTconnect exceeded` failure if the connection failed, but there is no info about how is the waiting going on
2022.03.17 02:56:09 LOG7[1105:139990955789440]: Dispatching signals from the signal pipe

# Failed when hand shake
2022.03.23 16:10:49 LOG6[19003:140551554406144]: CERT: Host name "$FS_ID.efs.us-east-1.amazonaws.com" matched with "*.efs.us-east-1.amazonaws.com" -> Again, stunnel stopped working in middle of the connection re-establishment
2022.03.23 16:56:09 LOG7[19003:140551554411648]: Dispatching signals from the signal pipe

Cappuccinuo avatar Mar 25 '22 05:03 Cappuccinuo

Hi @0nebody, Nice job with the workaround daemonset, just be aware that you may not need to mount /proc since "hostPID" is already set to true. This means that the containers within the pod would use the host PID namespace and will already be able to see and interact with the processes in the host namespace.

If you need to test the effect of the script, you can temporarily update your EFS security group to remove/block access on port 2049. This would cause the NFS timeouts to eventually appear in kernel logs which would then trigger the script to execute.

Hi @Cappuccinuo, Stunnel is not included/installed in the EKS optimised AMI/worker nodes, instead it is installed within the EFS CSI Driver image (see here) which is running as a Daemonset in an EKS cluster. This docker image seems to clone the efs-utils GitHub repo and install the amazon-efs-utils package as per this spec file which specifies stunnel version 4.56. Therefore upgrading it is a bit challenging, it looks like you have to build your own EFS CSI driver image manually to update stunnel. Is it possible for a docker image version/tag of the EFS CSI driver to be available with stunnel upgraded to 5.63?

kvifern avatar Mar 26 '22 06:03 kvifern

Thanks @kvifern ,

Yea haven't figure out that part for stunnel upgrade, as the stunnel is a dependency of efs-utils, the stunnel exec itself should be installed on the node itself. The upgrade logic can be in the docker file.


For the workaround script, instead we should use df command, as stat command will take advantage of the kernel cache, so it may detect the unhealthy stunnel not on time due to the result is served from page cache.

Post here for a general usable script for both EC2 and csi-driver. Need to install jq package first with sudo yum -y install jq.

This script will read the stunnel process id from state file of the mount.

# stunnel_health_check.sh

#!/bin/bash

if [ ! `jq --version` ]; then
    echo "Please install jq package first"
fi

WAIT_TIME=60 # Time in seconds to wait before polling stunnel processes
TIMEOUT=2    # Time in seconds to wait for response from remote EFS system connected via stunnel

echo "Running stunnel health check script"
while true; do
    STUNNEL_NFS_MOUNTS=`mount | grep nfs4 | grep 127.0.0.1 | awk '{print $3}'`
    for STUNNEL_MOUNT in $STUNNEL_NFS_MOUNTS; do
        echo "Checking stunnel health for $STUNNEL_MOUNT"
        timeout $TIMEOUT df $STUNNEL_MOUNT >/dev/null 2>&1
        if [ $? -eq 124 ]; then
            echo "[`date`] Timeout when connecting to NFS, killing stunnel process"
            PORT=`mount | grep nfs4 | grep 127.0.0.1 | grep $STUNNEL_MOUNT | grep -o 'port=[0-9]\+' | cut -d "=" -f 2-`
            STATE_FILE=`ls -1 /var/run/efs | grep -v stunnel-config.* | grep $PORT$`
            PROCESS_PID=`cat /var/run/efs/$STATE_FILE | jq '.pid'`
            kill -9 $PROCESS_PID
            echo "[`date`] Killed stunnel process pid: $PROCESS_PID mount: $STUNNEL_MOUNT port: $PORT"
        else
            echo "Stunnel health OK for $STUNNEL_MOUNT"
        fi
    done
    sleep $WAIT_TIME
done

Record the script location, and touch the following file in /lib/systemd/system, remember to replace the /home/ec2-user/stunnel_health_check.sh with the location on your own instance.

# amazon-efs-tls-mount-stunnel-health-monitor.service
[Unit]
Description=Monitor health of stunnel processes created by the tls mounts using efs-utils
Requires=amazon-efs-mount-watchdog.service

[Service]
ExecStart=/bin/bash /home/ec2-user/stunnel_health_check.sh 

[Install]
WantedBy=multi-user.target

Then run the following command to enable the health monitor service.

sudo systemctl enable amazon-efs-tls-mount-stunnel-health-monitor.service
sudo systemctl start amazon-efs-tls-mount-stunnel-health-monitor.service

You can check the stunnel monitor service by running the following command

sudo systemctl status amazon-efs-tls-mount-stunnel-health-monitor.service

To stop the service

sudo systemctl stop amazon-efs-tls-mount-stunnel-health-monitor.service
sudo systemctl disable amazon-efs-tls-mount-stunnel-health-monitor.service

To check the logs

journalctl -u amazon-efs-tls-mount-stunnel-health-monitor.service

Cappuccinuo avatar Mar 28 '22 14:03 Cappuccinuo

Hey folks,

You can try the latest driver to see if the issue is fixed. Thanks.

kubectl kustomize "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.3" > driver.yaml

kubectl apply -f driver.yaml

Cappuccinuo avatar Apr 15 '22 01:04 Cappuccinuo

Hi @Cappuccinuo , that would be great. ~~But i can not see any closed PR that tackles this issue here. What change might prevent the hanging mounts?~~ Just found the connected efs-utils v1.32.1 in the release notes. We will try the new fix.

ewuknadierf avatar Apr 19 '22 06:04 ewuknadierf

@Cappuccinuo it seems to me that the EFS Utils stunnel watcher runs on a 5 minute interval. image

Are we able to speed it up on our end? We were acustomed to the 60 second interval with the fix above.

msolberg8 avatar Apr 26 '22 13:04 msolberg8

@msolberg8 Yes this can be changed on the instance efs-utils config file(/etc/amazon/efs/efs-utils.conf), https://github.com/aws/efs-utils/blob/master/dist/efs-utils.conf#L67-L70

Cappuccinuo avatar Apr 26 '22 13:04 Cappuccinuo

@Cappuccinuo thanks!!

Any idea why now all of a sudden stunnel needs a healthcheck? While we're very thankful for the update, this seems like a bandaid rather than a root cause fix.

Also, are there plans to update the helm chart to support easier configuration of the stunnel healthcheck?

msolberg8 avatar Apr 28 '22 17:04 msolberg8

@msolberg8

As far as I can tell the culprit is the stunnel software itself, the process stucks and becomes zombie process after certain period, and the only way to fix this is terminate it and let watchdog launch a new one. Unless people who try with stunnel 5.63 and find out the same issue happens then we can report to stunnel side and engage them for a root cause, otherwise they won't look into an old version. I tried with 5.63 and haven't encountered any hanging issue so far.

For the helm chart one, what's the experience if you need to modify the configuration? I bet you need to modify the efs-utils.conf file (/etc/amazon/efs/efs-utils.conf) on the node instance right?

Cappuccinuo avatar Apr 29 '22 01:04 Cappuccinuo

+1 to setting this with the helm chart. Updating all of the node groups with custom launch templates is quite the chore to configure this one setting.

artificial-aidan avatar May 03 '22 16:05 artificial-aidan

@Cappuccinuo is there any plan in place to upgrade the stunnel in the official version to 5.63?

From what I understand stunnel is pulled in when aws-efs-utils is installed. Would it be worth making an issue over there?

And to answer your question above yes we would need to set that up at the node level which we'd rather not do. We try and keep the nodes as plain jane as we can.

msolberg8 avatar May 31 '22 19:05 msolberg8

Yea we are working on efs-utils side, you can track https://github.com/aws/efs-utils/issues/99. We need to coordinate with repository side to publish stunnel 5.x.

Cappuccinuo avatar Jun 01 '22 01:06 Cappuccinuo

subscribing as we have a similar issue but killing stunnel doesn't fix it for us still

Jun 7 01:00:38 [ip] kernel: nfs: server 127.0.0.1 not responding, timed out

is there some other things to check/restart ? (vs killing the node; I'd like to understand how it gets in this state and a workaround to unblock it)

ldemailly avatar Jun 07 '22 01:06 ldemailly

edit: killing these 2 processes (ps btw was hanging too!)

root     18314  0.0  0.0 717228 13624 ?        Ssl  May25   3:21 /bin/aws-efs-csi-driver --endpoint=unix:///var/lib/csi/sockets/pluginproxy/csi.sock --logtostderr --v=2 --delete-access-point-root-dir=false --vol-metrics-opt-in=false
root     20172  0.0  0.1 717228 16288 ?        Ssl  May25  13:41 /bin/aws-efs-csi-driver --endpoint=unix:/csi/csi.sock --logtostderr --v=2

and back to

Jun  7 01:04:16 [ip] kernel: nfs: server 127.0.0.1 OK

ldemailly avatar Jun 07 '22 01:06 ldemailly

In our case the NFS issue was triggered by a pod overloading the node. We got it resolved by setting resource limits to the problematic deployment.

yevgeny-kushnir avatar Jun 08 '22 06:06 yevgeny-kushnir

I'm facing exactly the same issue as pointed by https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1022959423

k8s version: v1.21.12-eks-a64ea69 aws-efs-csi-driver: v1.3.3

Killing the stunnel process, as mentioned by in https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1022959423 and https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1072965716, didn't help me. I also tried https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1148085899 without any success.

My observation:

  • This issue was happening on nodes with pods using a very particular efs volume, x.
  • I deleted the PV and PVC followed by recreating it.
  • After force killing the pod, I saw the following in the kubelet logs:
26930 kubelet_volumes.go:225] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"2174a688-3672-4d6e-a775-58627b61d629\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=3   
26930 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-68ece2b9::fsap-078492ba97b679d61 podName:3b06e0d6-63c8-45fc-bcde-6906cdd0d395 nodeName:}" failed. No retries permitted until 2022-06-08 07:06:03.158716657 +0000 UTC m=+5949202.959554767 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"x\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-68ece2b9::fsap-078492ba97b679d61\") pod \"3b06e0d6-63c8-45fc-bcde-6906cdd0d395\" (UID: \"3b06e0d6-63c8-45fc-bcde-6906cdd0d395\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
  • For the buggy mount path for "x", umount failed. umount -f showed the device was busy. So I did umount -l. “df” started working after that. But we were still seeing “nfs: server 127.0.0.1 not responding, timed out” But new nfs mount and umount is still working.

  • To make matters worst existing mounts stop working.

sedflix avatar Jun 08 '22 06:06 sedflix

Faced exact same issue., we were using 1.3.6 tag

Observations

  • Node load avg keeps increasing (In some nodes with 16 core it went high as 3k)
  • We can observe nfs: server 127.0.0.1 not responding, timed out
  • Our main app process completely stuck trying to access the disk.
  • df or ls from the problematic mounted directory hangs

For us, the tunnel process went defunct we have to tweak workaround script to kill all defunct tunnel process in a loop.

Fix We got know that 1.3.7 tag has the fix for this. they have implemented watchdog and restart the hanging process. fix is done in utils v1.32.1 and is being part of release 1.3.7 tag changelog

Thanks @markmcdowell for starting this thread, it is very useful.

madhanshanmugam avatar Jun 24 '22 10:06 madhanshanmugam

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 22 '22 10:09 k8s-triage-robot

/remove-lifecycle stale

Still waiting for https://github.com/aws/efs-utils/issues/99 to be closed, then a new version of the driver should be available.

msolberg8 avatar Sep 27 '22 12:09 msolberg8