aws-efs-csi-driver mounts hang when efs-csi-node pods are restarted because of empty privateKey.pem

/kind bug

What happened?

The same issue as #178 and #569, still not solved.

After the EKS driver container is replaced (i.e. by terminating the driver process or upgrading the driver to a new image), all existing mounts on that node hang for 1 hour

Warning  FailedMount  22m (x62 over 4h59m)   kubelet  Unable to attach or mount volumes: unmounted volumes=efs-data

Warning  FailedMount  6m54s (x81 over 5h2m)  kubelet  MountVolume.SetUp failed for volume "xxx-efs" : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock: connect: connection refused
mount failed: exit status 1

Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-xxxx,tls,noatime fs-xxxxx:/ /var/lib/kubelet/pods/xxxxx/volumes/kubernetes.io~csi/xxxxx/mount
Failed to create certificate signing request (csr), error is: b'unable to load Private Key\xxxx:error:xxx routines:PEM_read_bio:no start line:pem_lib.c:707:Expecting: ANY PRIVATE KEY\

Reason

privateKey.pem that is persisted on the node happens to become an empty file, but the check is not able to detect that and thus not recreating the key. Hence the node is stale for 1 hour until the cert is is purged.

After efs-csi-node pod restart, note the empty file privateKey.pem when above issues are logged:

/ # ls -la /host/var/amazon/efs/
-rw-r--r--    1 root     root          2707 Apr 26 07:08 efs-utils.conf
-rw-r--r--    1 root     root          4789 Apr 26 01:16 efs-utils.crt
-rw-r--r--    1 root     root             0 Apr 26 01:17 privateKey.pem

Workaround: delete the privateKey.pem and restart the pod:

# ls -la /host/var/amazon/efs/
-rw-r--r--    1 root     root          2707 Apr 26 08:46 efs-utils.conf
-rw-r--r--    1 root     root          4789 Apr 26 01:16 efs-utils.crt
-r--------    1 root     root          2484 Apr 26 08:47 privateKey.pem

What you expected to happen? Nodes to stay healthy

How to reproduce it (as minimally and precisely as possible)?

same as #178

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): 1.21
Driver version: 1.3.7

Apr 26 '22 12:04 universam1

/cc @dimitriosstander

Apr 26 '22 13:04 universam1

@wongma7 provided a potential solution: https://github.com/aws/efs-utils/pull/130

Apr 26 '22 13:04 universam1

Workaround for the time being, add an initContainer that deletes the invalid key:

      initContainers:
      - command: ["/bin/sh"]
        args: ["-c", "test -s /var/amazon/efs/privateKey.pem || rm -f /var/amazon/efs/privateKey.pem"]
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.3.7
        name: purge-invalid-key
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /var/amazon/efs
          name: efs-utils-config

Apr 26 '22 14:04 universam1

Somehow we've encountered the bug today with the driver in version 1.3.3 after a rollout where we wanted to restart the driver for all the cluster.

At first it looked like a mount issue (unbound immediate PersistentVolumeClaims). The socket returning a "connection refused" made us think of a network issue, but our security groups for the EFS Access Points were OK... It's only after a while that the private key message appears. And it was the issue mentionned here.

I suppose the issue is that the driver wants to start a STunnel (because we enforce TLS for the nfs call to EFS), and for that it wants to create a certificate (thus crafting a CSR). As the private key is corrupted (empty), it fails.

Thanks a lot @universam1 for the analysis and the workaround!

Apr 27 '22 15:04 NBardelot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 26 '22 15:07 k8s-triage-robot

/remove-lifecycle stale Kindly asking for review @wongma7

Jul 26 '22 19:07 universam1

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 24 '22 19:10 k8s-triage-robot

/remove-lifecycle stale

Oct 25 '22 05:10 universam1

Would be nice if this could be picked up since we are also facing this...

Dec 13 '22 08:12 stijnbrouwers

Same here, this issue is plaguing one of our AWS EKS clusters. Could you please fix? Thank you

Dec 27 '22 12:12 log2

According to efs-utils #130, this has been fixed in v1.4.9 of aws-efs-csi-driver.

Jan 10 '23 17:01 bentatham

This issue happens on 602401143452.dkr.ecr.ap-northeast-2.amazonaws.com/eks/aws-efs-csi-driver:v1.4.9

  Warning  FailedMount  2m44s  kubelet            MountVolume.SetUp failed for volume "pvc-d830f7d0-7dbf-4015-8219-85caf3eaa742" : rpc error: code = Internal desc = Could not mount "fs-0750ae3c64b8ddba0:/" at "/var/lib/kubelet/pods/bdc6f9a1-2f46-419f-9c52-b4e638ee695f/volumes/kubernetes.io~csi/pvc-d830f7d0-7dbf-4015-8219-85caf3eaa742/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0b55f538a16fe95ff,tls fs-0750ae3c64b8ddba0:/ /var/lib/kubelet/pods/bdc6f9a1-2f46-419f-9c52-b4e638ee695f/volumes/kubernetes.io~csi/pvc-d830f7d0-7dbf-4015-8219-85caf3eaa742/mount
Output: Failed to create certificate signing request (csr), error is: b'unable to load Private Key\n139675987441568:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:707:Expecting: ANY PRIVATE KEY\n'

Mar 23 '23 12:03 vumdao

Hi @vumdao, Thanks for bringing here. If possible can you share the troubleshooting logs here? https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

Apr 26 '23 14:04 mskanth972

/kind bug

May 15 '23 14:05 RyanStan

same here, seems that sometime the privateKey.pem is empty

workaround of @universam1 is working, but I had to switch to another image (alpine) because rm command is not in the efs-csi-node image itself.

That said, it's currently not possible to add this workaround directly on the chart (no initContainers here: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/charts/aws-efs-csi-driver/templates/node-daemonset.yaml). @mskanth972 would it be possible update the chart so we can set initContainers in the daemonset?

Jun 13 '23 11:06 headyj

I think this PR to efs-utils should fix this issue https://github.com/aws/efs-utils/pull/174/files. We are still facing the issue even with the initContainer.

v1.4.9 didn't fix it completely because the fix was only applied to the watchdog folder, and the same code is duplicated in the mount_efs one

Aug 27 '23 12:08 otorreno

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 27 '24 03:01 k8s-triage-robot

/remove-lifecycle stale

Jan 28 '24 16:01 universam1

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 27 '24 17:04 k8s-triage-robot

/remove-lifecycle stale

May 14 '24 13:05 fradee

aws-efs-csi-driver aws-efs-csi-driver copied to clipboard

mounts hang when efs-csi-node pods are restarted because of empty privateKey.pem

Reason

aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard