aws-efs-csi-driver Liveness probe for Pods using EFS volume mounts fails after upgrade/downgrade of EFS version

I have an EKS cluster (1.27) with EFS 1.5.6 running. When I install prometheus on it, the pods come up, volume mounts are successful. When I upgrade EFS to 1.6.0 version, the prometheus-operator statefulset goes to NotReady state. I can see this from the kubectl events Liveness probe failed: Get "http://10.0.3.219:9090/-/healthy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Similar behaviour is seen with wordpress app. This happens only in upgrade or downgrade scenario. I have checked this for other CSI like EBS and Vsphere, this issue is not seen, so it is a EFS issue.

What you expected to happen? Liveness probe should not fail because of upgrade ot downgrade of EFS pack.

How to reproduce it (as minimally and precisely as possible)?

Create a EKS cluster with EFS 1.5.6
Install prometheus on top of it.
Upgrade EFS to 1.6.0

Environment

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.4-eks-2d98532", GitCommit:"3d90c097c72493c2f1a9dd641e4a22d24d15be68", GitTreeState:"clean", BuildDate:"2023-07-28T16:51:44Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

Driver version: 1.6.0

Sep 27 '23 08:09 roshanirathi

@roshanirathi How do you install the driver? Also, if you can, could you please provide DEBUG level logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/README.md

Sep 29 '23 13:09 seanzatzdev-amazon

@roshanirathi Do you have any similar error logs as this hostNetwork issue [link]?

Sep 29 '23 18:09 seanzatzdev-amazon

I use helm chart for driver installation. The efs-utils logs are similar. And the issue is also same. Mounts don't work after upgrade.

The debug logs are present here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing

Oct 09 '23 16:10 roshanirathi

The issue is fixed in 1.7.0 version. Closing this.

Oct 10 '23 17:10 roshanirathi

/reopen

Oct 11 '23 11:10 roshanirathi

@roshanirathi: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 11 '23 11:10 k8s-ci-robot

I am still seeing this issue when I have EBS as a CSI layer and EFS as an addon CSI. The volume mounts are using EFS storageclass and same issue is seen when I upgrade from 1.5.6 to 1.7.0.

Oct 11 '23 11:10 roshanirathi

Hi @roshanirathi , to ensure that we have the most accurate information, could you please provide the new debug level logs?

Oct 11 '23 13:10 seanzatzdev-amazon

Added the logs here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing

Oct 11 '23 14:10 roshanirathi

@roshanirathi From what I understand, you did the following:

Created a EKS cluster with EFS 1.5.6 via the add on
Installed prometheus on top of it.
Upgraded EFS to 1.7.0 via the add on
Mounted a volume using the EFS storage class

I don't fully understand where EBS factors into your setup. Do you have both EFS & EBS CSI drivers running?

Could you please provide a list of instructions to replicate the new issue?

Oct 12 '23 14:10 seanzatzdev-amazon

Create a EKS cluster with EBS CSI.
Deploy EFS 1.5.6 driver on it.
Deploy prometheus or wordpress app on it.
Once volumes are mounted, upgrade EFS to 1.7.0.
Once the new EFS pods are up, the volume mount pods go for a toss.

Yes, I have both EBS and EFS on a EKS cluster.

Oct 12 '23 14:10 roshanirathi

Any update on this?

Oct 24 '23 04:10 roshanirathi

We have experienced this same issue when upgrading to 1.7.0 via helm.

Existing pods using EFS all started to fail in accessing their EFS volumes and new pods coming up were unable to mount EFS volumes with Unable to attach or mount volumes: unmounted volumes=[instance-cache], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition until the node was terminated and relaunched.

Oct 30 '23 06:10 william00179

@seanzatzdev-amazon any update on this? I am seeing the same issue with 1.7.1 version as well.

Jan 11 '24 10:01 roshanirathi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 10 '24 11:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 10 '24 11:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 09 '24 12:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 09 '24 12:06 k8s-ci-robot

aws-efs-csi-driver aws-efs-csi-driver copied to clipboard

Liveness probe for Pods using EFS volume mounts fails after upgrade/downgrade of EFS version

aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard