aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

Liveness probe for Pods using EFS volume mounts fails after upgrade/downgrade of EFS version

Open roshanirathi opened this issue 1 year ago • 16 comments

I have an EKS cluster (1.27) with EFS 1.5.6 running. When I install prometheus on it, the pods come up, volume mounts are successful. When I upgrade EFS to 1.6.0 version, the prometheus-operator statefulset goes to NotReady state. I can see this from the kubectl events Liveness probe failed: Get "http://10.0.3.219:9090/-/healthy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Similar behaviour is seen with wordpress app. This happens only in upgrade or downgrade scenario. I have checked this for other CSI like EBS and Vsphere, this issue is not seen, so it is a EFS issue.

What you expected to happen? Liveness probe should not fail because of upgrade ot downgrade of EFS pack.

How to reproduce it (as minimally and precisely as possible)?

  1. Create a EKS cluster with EFS 1.5.6
  2. Install prometheus on top of it.
  3. Upgrade EFS to 1.6.0

Environment

  • Kubernetes version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.4-eks-2d98532", GitCommit:"3d90c097c72493c2f1a9dd641e4a22d24d15be68", GitTreeState:"clean", BuildDate:"2023-07-28T16:51:44Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version: 1.6.0

roshanirathi avatar Sep 27 '23 08:09 roshanirathi

@roshanirathi How do you install the driver? Also, if you can, could you please provide DEBUG level logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/README.md

seanzatzdev-amazon avatar Sep 29 '23 13:09 seanzatzdev-amazon

@roshanirathi Do you have any similar error logs as this hostNetwork issue [link]?

seanzatzdev-amazon avatar Sep 29 '23 18:09 seanzatzdev-amazon

I use helm chart for driver installation. The efs-utils logs are similar. And the issue is also same. Mounts don't work after upgrade.

The debug logs are present here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing

roshanirathi avatar Oct 09 '23 16:10 roshanirathi

The issue is fixed in 1.7.0 version. Closing this.

roshanirathi avatar Oct 10 '23 17:10 roshanirathi

/reopen

roshanirathi avatar Oct 11 '23 11:10 roshanirathi

@roshanirathi: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 11 '23 11:10 k8s-ci-robot

I am still seeing this issue when I have EBS as a CSI layer and EFS as an addon CSI. The volume mounts are using EFS storageclass and same issue is seen when I upgrade from 1.5.6 to 1.7.0.

roshanirathi avatar Oct 11 '23 11:10 roshanirathi

Hi @roshanirathi , to ensure that we have the most accurate information, could you please provide the new debug level logs?

seanzatzdev-amazon avatar Oct 11 '23 13:10 seanzatzdev-amazon

Added the logs here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing

roshanirathi avatar Oct 11 '23 14:10 roshanirathi

@roshanirathi From what I understand, you did the following:

  • Created a EKS cluster with EFS 1.5.6 via the add on
  • Installed prometheus on top of it.
  • Upgraded EFS to 1.7.0 via the add on
  • Mounted a volume using the EFS storage class

I don't fully understand where EBS factors into your setup. Do you have both EFS & EBS CSI drivers running?

Could you please provide a list of instructions to replicate the new issue?

seanzatzdev-amazon avatar Oct 12 '23 14:10 seanzatzdev-amazon

  1. Create a EKS cluster with EBS CSI.
  2. Deploy EFS 1.5.6 driver on it.
  3. Deploy prometheus or wordpress app on it.
  4. Once volumes are mounted, upgrade EFS to 1.7.0.
  5. Once the new EFS pods are up, the volume mount pods go for a toss.

Yes, I have both EBS and EFS on a EKS cluster.

roshanirathi avatar Oct 12 '23 14:10 roshanirathi

Any update on this?

roshanirathi avatar Oct 24 '23 04:10 roshanirathi

We have experienced this same issue when upgrading to 1.7.0 via helm.

Existing pods using EFS all started to fail in accessing their EFS volumes and new pods coming up were unable to mount EFS volumes with Unable to attach or mount volumes: unmounted volumes=[instance-cache], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition until the node was terminated and relaunched.

william00179 avatar Oct 30 '23 06:10 william00179

@seanzatzdev-amazon any update on this? I am seeing the same issue with 1.7.1 version as well.

roshanirathi avatar Jan 11 '24 10:01 roshanirathi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 10 '24 11:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 10 '24 11:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jun 09 '24 12:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 09 '24 12:06 k8s-ci-robot