aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard
Liveness probe for Pods using EFS volume mounts fails after upgrade/downgrade of EFS version
I have an EKS cluster (1.27) with EFS 1.5.6 running. When I install prometheus
on it, the pods come up, volume mounts are successful.
When I upgrade EFS to 1.6.0 version, the prometheus-operator
statefulset goes to NotReady
state.
I can see this from the kubectl events
Liveness probe failed: Get "http://10.0.3.219:9090/-/healthy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Similar behaviour is seen with wordpress app. This happens only in upgrade or downgrade scenario. I have checked this for other CSI like EBS and Vsphere, this issue is not seen, so it is a EFS issue.
What you expected to happen? Liveness probe should not fail because of upgrade ot downgrade of EFS pack.
How to reproduce it (as minimally and precisely as possible)?
- Create a EKS cluster with EFS 1.5.6
- Install prometheus on top of it.
- Upgrade EFS to 1.6.0
Environment
- Kubernetes version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.4-eks-2d98532", GitCommit:"3d90c097c72493c2f1a9dd641e4a22d24d15be68", GitTreeState:"clean", BuildDate:"2023-07-28T16:51:44Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
- Driver version: 1.6.0
@roshanirathi How do you install the driver? Also, if you can, could you please provide DEBUG level logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/README.md
@roshanirathi Do you have any similar error logs as this hostNetwork issue [link]?
I use helm chart for driver installation. The efs-utils logs are similar. And the issue is also same. Mounts don't work after upgrade.
The debug logs are present here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing
The issue is fixed in 1.7.0 version. Closing this.
/reopen
@roshanirathi: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I am still seeing this issue when I have EBS as a CSI layer and EFS as an addon CSI. The volume mounts are using EFS storageclass and same issue is seen when I upgrade from 1.5.6 to 1.7.0.
Hi @roshanirathi , to ensure that we have the most accurate information, could you please provide the new debug level logs?
Added the logs here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing
@roshanirathi From what I understand, you did the following:
- Created a EKS cluster with EFS 1.5.6 via the add on
- Installed prometheus on top of it.
- Upgraded EFS to 1.7.0 via the add on
- Mounted a volume using the EFS storage class
I don't fully understand where EBS factors into your setup. Do you have both EFS & EBS CSI drivers running?
Could you please provide a list of instructions to replicate the new issue?
- Create a EKS cluster with EBS CSI.
- Deploy EFS 1.5.6 driver on it.
- Deploy prometheus or wordpress app on it.
- Once volumes are mounted, upgrade EFS to 1.7.0.
- Once the new EFS pods are up, the volume mount pods go for a toss.
Yes, I have both EBS and EFS on a EKS cluster.
Any update on this?
We have experienced this same issue when upgrading to 1.7.0 via helm.
Existing pods using EFS all started to fail in accessing their EFS volumes and new pods coming up were unable to mount EFS volumes with Unable to attach or mount volumes: unmounted volumes=[instance-cache], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
until the node was terminated and relaunched.
@seanzatzdev-amazon any update on this? I am seeing the same issue with 1.7.1 version as well.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.