aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard
mounts hang when efs-csi-node pods are restarted because of empty privateKey.pem
/kind bug
What happened?
The same issue as #178 and #569, still not solved.
After the EKS driver container is replaced (i.e. by terminating the driver process or upgrading the driver to a new image), all existing mounts on that node hang for 1 hour
Warning FailedMount 22m (x62 over 4h59m) kubelet Unable to attach or mount volumes: unmounted volumes=efs-data
Warning FailedMount 6m54s (x81 over 5h2m) kubelet MountVolume.SetUp failed for volume "xxx-efs" : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock: connect: connection refused
mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-xxxx,tls,noatime fs-xxxxx:/ /var/lib/kubelet/pods/xxxxx/volumes/kubernetes.io~csi/xxxxx/mount
Failed to create certificate signing request (csr), error is: b'unable to load Private Key\xxxx:error:xxx routines:PEM_read_bio:no start line:pem_lib.c:707:Expecting: ANY PRIVATE KEY\
Reason
privateKey.pem
that is persisted on the node happens to become an empty file, but the check is not able to detect that and thus not recreating the key. Hence the node is stale for 1 hour until the cert is is purged.
After efs-csi-node pod restart, note the empty file privateKey.pem when above issues are logged:
/ # ls -la /host/var/amazon/efs/
-rw-r--r-- 1 root root 2707 Apr 26 07:08 efs-utils.conf
-rw-r--r-- 1 root root 4789 Apr 26 01:16 efs-utils.crt
-rw-r--r-- 1 root root 0 Apr 26 01:17 privateKey.pem
Workaround: delete the privateKey.pem
and restart the pod:
# ls -la /host/var/amazon/efs/
-rw-r--r-- 1 root root 2707 Apr 26 08:46 efs-utils.conf
-rw-r--r-- 1 root root 4789 Apr 26 01:16 efs-utils.crt
-r-------- 1 root root 2484 Apr 26 08:47 privateKey.pem
What you expected to happen? Nodes to stay healthy
How to reproduce it (as minimally and precisely as possible)?
same as #178
Anything else we need to know?:
Environment
- Kubernetes version (use
kubectl version
): 1.21 - Driver version: 1.3.7
/cc @dimitriosstander
@wongma7 provided a potential solution: https://github.com/aws/efs-utils/pull/130
Workaround for the time being, add an initContainer
that deletes the invalid key:
initContainers:
- command: ["/bin/sh"]
args: ["-c", "test -s /var/amazon/efs/privateKey.pem || rm -f /var/amazon/efs/privateKey.pem"]
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.3.7
name: purge-invalid-key
securityContext:
privileged: true
volumeMounts:
- mountPath: /var/amazon/efs
name: efs-utils-config
Somehow we've encountered the bug today with the driver in version 1.3.3 after a rollout where we wanted to restart the driver for all the cluster.
At first it looked like a mount issue (unbound immediate PersistentVolumeClaims). The socket returning a "connection refused" made us think of a network issue, but our security groups for the EFS Access Points were OK... It's only after a while that the private key message appears. And it was the issue mentionned here.
I suppose the issue is that the driver wants to start a STunnel (because we enforce TLS for the nfs call to EFS), and for that it wants to create a certificate (thus crafting a CSR). As the private key is corrupted (empty), it fails.
Thanks a lot @universam1 for the analysis and the workaround!
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale Kindly asking for review @wongma7
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Would be nice if this could be picked up since we are also facing this...
Same here, this issue is plaguing one of our AWS EKS clusters. Could you please fix? Thank you
According to efs-utils #130, this has been fixed in v1.4.9 of aws-efs-csi-driver.
This issue happens on 602401143452.dkr.ecr.ap-northeast-2.amazonaws.com/eks/aws-efs-csi-driver:v1.4.9
Warning FailedMount 2m44s kubelet MountVolume.SetUp failed for volume "pvc-d830f7d0-7dbf-4015-8219-85caf3eaa742" : rpc error: code = Internal desc = Could not mount "fs-0750ae3c64b8ddba0:/" at "/var/lib/kubelet/pods/bdc6f9a1-2f46-419f-9c52-b4e638ee695f/volumes/kubernetes.io~csi/pvc-d830f7d0-7dbf-4015-8219-85caf3eaa742/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0b55f538a16fe95ff,tls fs-0750ae3c64b8ddba0:/ /var/lib/kubelet/pods/bdc6f9a1-2f46-419f-9c52-b4e638ee695f/volumes/kubernetes.io~csi/pvc-d830f7d0-7dbf-4015-8219-85caf3eaa742/mount
Output: Failed to create certificate signing request (csr), error is: b'unable to load Private Key\n139675987441568:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:707:Expecting: ANY PRIVATE KEY\n'
Hi @vumdao, Thanks for bringing here. If possible can you share the troubleshooting logs here? https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting
/kind bug
same here, seems that sometime the privateKey.pem is empty
workaround of @universam1 is working, but I had to switch to another image (alpine) because rm command is not in the efs-csi-node image itself.
That said, it's currently not possible to add this workaround directly on the chart (no initContainers
here: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/charts/aws-efs-csi-driver/templates/node-daemonset.yaml). @mskanth972 would it be possible update the chart so we can set initContainers in the daemonset?
I think this PR to efs-utils should fix this issue https://github.com/aws/efs-utils/pull/174/files. We are still facing the issue even with the initContainer.
v1.4.9 didn't fix it completely because the fix was only applied to the watchdog folder, and the same code is duplicated in the mount_efs one
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale