aws-ebs-csi-driver PVC attaching takes much time

Duplicating case https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1279 as we observe the same issue when detaching/deleting and attaching PVC to a new node takes more than 6 min and it's followed by a timeout

What happened?

  Normal   Scheduled               7m21s                  default-scheduler        Successfully assigned dev-test/bigid-cache-0 to ip-13-118-32-127.ec2.internal
  Warning  FailedAttachVolume      7m21s                  attachdetach-controller  Multi-Attach error for volume "pvc-xxx" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount             3m2s (x2 over 5m18s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[redis-data], unattached volumes=[health redis-data config redis-tmp-conf tmp kube-api-access-98lr9 start-scripts]: timed out waiting for the condition
  Normal   SuccessfulAttachVolume  80s                    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-xxx"
  Normal   Pulling                 64s                    kubelet                  Pulling image "docker.io/bitnami/redis:6.2.7-debian-10-r32"
  Normal   Pulled                  62s                    kubelet                  Successfully pulled image "docker.io/bitnami/redis:6.2.7-debian-10-r32" in 2.812050986s
  Normal   Created                 62s                    kubelet                  Created container redis
  Normal   Started                 62s                    kubelet                  Started container redis

EBS CSI Controller logs Failure to Delete EBS volume after Detach

I0706 19:22:20.580181       1 request.go:665] Waited for 1.119392703s due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3/status
I0706 19:22:21.580381       1 request.go:665] Waited for 1.331653146s due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-1973672463aeb63fb239e6370c04bb7df8bb50290d8d9f2a0d3690d134f3521a/status
I0706 19:22:21.786329       1 csi_handler.go:286] Failed to save detach error to "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3": volumeattachments.storage.k8s.io "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3" not found
I0706 19:22:21.786354       1 csi_handler.go:231] Error processing "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3" not found
I0706 19:22:21.877958       1 csi_handler.go:261] Attached "csi-d0cb7bac7a53511c451712aaa8bc8bac79adc2d3426d993d1f14a135d84fb5ab"
I0706 19:22:21.986228       1 csi_handler.go:286] Failed to save detach error to "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d": volumeattachments.storage.k8s.io "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d" not found
I0706 19:22:21.986254       1 csi_handler.go:231] Error processing "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d" not found

What you expected to happen? It should not take so much time for PVC detaching and attaching to new k8s node. Compare to default Kubernetes Storage driver (gp2) where I do not see this issue it take a lot of time

How to reproduce it (as minimally and precisely as possible)? You just need to terminate EC2 instance (k8s node) directly which statefulset is running on

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): 1.22
Driver version: 1.6.2

Jul 07 '22 11:07 sergeymatsak87

I believe this is expected behavior, if you terminate your EC2 instance ungracefully (i.e. without safely evicting all of your pods through kubectl drain) the pods that are part of a StatefulSet will be stuck terminating on the shutdown node and cannot move to a new running node since the kubelet is not available to delete the pods. If there are volumes used by the pods, the VolumeAttachment objects will not be deleted from the original node so the volumes used by these pods cannot be attached to a new running node. There is work being done upstream to address this, K8s v1.24 introduces alpha support for Non-Graceful Node Shutdown which would reduce this time, for more information see the KEP and https://github.com/kubernetes/kubernetes/pull/108486.

The 6 minute wait time before timing out is due to the default configuration of Attach/Detach controller timers. More specifically, ReconcilerMaxWaitForUnmountDuration which is the max amount of time the attach/detach controller waits for a volume to be safely unmounted from a node. After 6 minutes, the controller assumes the node is unresponsive and attempts to detach the volume.

Jul 07 '22 18:07 torredil

is ReconcilerMaxWaitForUnmountDuration configurable ? is it part of the control plane ? the same issue happens to us but not on ungraceful shutdown. it takes a long time for PVC to get reattached to the new node after a pod reschedules.

Jul 25 '22 13:07 amitde69

@amitde69 ReconcilerMaxWaitForUnmountDuration is currently not configurable: https://github.com/kubernetes/kubernetes/issues/105341. The previously mentioned KEP addresses this 6 minute wait time by issuing a force detach if the out-of-service taint is present on the node.

Jul 25 '22 15:07 torredil

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 23 '22 15:10 k8s-triage-robot

+1 this makes some apps unusable (e.g. strimzi kafka operator) because of the frequent timeouts. The same works fine on the default Kubernetes Storage driver (gp2).

Oct 26 '22 17:10 taylorsmithgg

I'm still seeing this /remove-lifecycle stale

Nov 23 '22 17:11 carlosjgp

I'm still seeing this K8s AWS EKS 1.22.11 EBS image v1.13.0 /remove-lifecycle stale

Nov 24 '22 09:11 carlosjgp

Experiencing the same behavior, and this is very concerning as this is a clear regression compared to previous versions of EKS.

We could observe the issue with a pod scheduled with Karpenter. Karpenter cordoned & drained the node on which the pod was running (during consolidation), and it took several minutes for the pod to get back to the running state. The pod event log looks exactly the same as in the original issue report.

I want to make it clear that the node was not terminated ungracefully.

Dec 02 '22 23:12 maximethebault

We're experiencing this on EKS 1.24 + EBS 1.12.

We use node-termination-handler to handle the graceful draining of the nodes. We observed that the volume becomes detached from the old node through AWS console, but the attachdetach controller doesn't detach the volume on the kubernetes api.

This seems to be an issue on attachdetach controller to be honest. I'll try to dig out what maybe happening.

Feb 03 '23 21:02 samuel-form3

Same here. EKS 1.24 Karpenter 0.23.0 ebs-csi-driver 1.15.0

/remove-lifecycle stale

Feb 04 '23 13:02 stdmje

Unfortunately, we see this problem as well, and it has a severe negative impact on our workload. Has anyone come up with a solution or workaround yet?

Feb 21 '23 08:02 617m4rc

/kind bug

@torredil @ConnorJC3 comments + thumbs up deduplicated, there are now 12 people that found their way to this Github issue to report this exact same issue. And I suspect this will grow in number with EKS 1.22 ending support in June 2023 and EBS CSI Driver becoming mandatory in EKS 1.23.

Can we please have an official word on this? Do you confirm the issue, or do you still think this is an expected, perfectly normal behavior? Do you maybe lack resource to look into this, should we alert our AWS account reps to get more attention on this?

Feb 25 '23 16:02 maximethebault

Hi all, I can't make any specific promises before I get a chance to look into it, but I will investigate this bug sometime during the next work week.

In the meantime, can anyone affected here please clarify that this issue is happening to them (only) when they terminate the EC2 instance behind a node without draining it first?

If you can reproduce this under other scenarios that are different, please provide information on how to do that, thanks!

Feb 25 '23 18:02 ConnorJC3

Thanks!

Running a "kubectl drain " on a node running a stateful workload is enough to reproduce this issue. The node doesn't have to be terminated ungracefully.

Feb 25 '23 22:02 maximethebault

For me killing a pod of a stateful set with a PVC attached is enough already to reproduce.

Feb 26 '23 07:02 jhass

I have this issue every time when we starting process of nodes replacement when updating AWS launch templates. Killing a pods from StatefulSet usually not enough so we also kill pods with EBS CSI Controller and then issue disappears.

Feb 28 '23 01:02 igor-nikiforov

Hi all, I can't make any specific promises before I get a chance to look into it, but I will investigate this bug sometime during the next work week.

In the meantime, can anyone affected here please clarify that this issue is happening to them (only) when they terminate the EC2 instance behind a node without draining it first?

If you can reproduce this under other scenarios that are different, please provide information on how to do that, thanks!

Hi @ConnorJC3

Anything else you need, you let us know!

Mar 12 '23 09:03 maximethebault

Hi all, sorry about the somewhat late response. I have been trying to reproduce this locally via methods such as kubectl drain, but I cannot seem to do so.

If anyone experiencing this issue not related to an ungraceful shutdown (ungraceful shutdowns causing this is expected behavior and out of our control) could provide one of the two, that would be very helpful:

A minimum set of manifests to reproduce the issue with some level of consistency OR
Logs from the EBS CSI Controller ebs-plugin and csi-attacher containers with the log level for both turned up to 4 when the issue occurs

Mar 13 '23 18:03 ConnorJC3

Not got reproduction steps per se, but running stateful sets with pv/pvcs on karpenter on spot instances with a node lifespan TTL of 15mins is a nice way to force the issue to happen quickly, as nodes are constantly churning and re-attaching to the PVs takes a while.

Apr 12 '23 13:04 danielloader

Using Spot instances on AWS and when an instance is terminated the same issue occurs

EKS: 1.22 ebs-csi-driver: v1.17 node-termination-handler: v1.19

Apr 13 '23 07:04 tkatrichenko

I don't know what changed exactly, but I cannot reproduce the issue after upgrading to the driver's latest version.

Reverting to an older version like 1.12.1 and the issue is back.

I will upgrade all production clusters to the latest version and keep you updated if the issue is back. Meanwhile, would be nice if anyone else can confirm the issue is gone after upgrading to the driver latest version.

Apr 15 '23 15:04 maximethebault

@maximethebault After upgrading your production clusters, are you still seeing the issue?

Apr 25 '23 21:04 maherbeg

I could confirm that this issue is still exist with EKS 1.25 / 1.26 and latest EBS driver version 1.18.0.

May 09 '23 15:05 igor-nikiforov

Hi all, I also faced with PVC(EBS) attach timeouts. Sometimes PVC attachment takes up to 1 hour. In csi-attacher container logs I see lots of "client-side throttling". Is there the way to increase request's quota on csi-attacher side ?

I0515 15:12:23.707456       1 request.go:682] Waited for 3.792550031s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-485c1d77-30e1-4006-bad4-396eeee17a6b
I0515 15:12:24.907002       1 request.go:682] Waited for 3.766319675s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-cef2f6e7-df16-4d1d-ad50-7981093c78ad
I0515 15:12:25.907372       1 request.go:682] Waited for 3.392822052s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-20aa26b3-f495-456e-a284-d9327bf73a84
I0515 15:12:27.107446       1 request.go:682] Waited for 3.192762704s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-591f7379-5c9f-45f1-a42d-ef53808d0684

May 15 '23 15:05 VitaliiBabii

PVC attaching delay fixed by increasing attacher and provisioner qps and burst . --kube-api-burst=20 --kube-api-qps=10.0

May 18 '23 19:05 VitaliiBabii

Is this related? https://github.com/kubernetes-csi/external-attacher/issues/215

May 24 '23 17:05 BrianKopp

In our tests, we were able to reproduce an issue where a volume remained attached to a node after the pod was deleted. This issue is relatively easy to reproduce (within a few hundred volume attachments) and would not clear itself without manual intervention.

It's possible there are multiple issues which users are experiencing in this thread, however, this one was the most disruptive to our users.

This reproduction is visible in the following places:

Node.volumesAttached and Node.volumesInUse were not empty on the old node (the stuck volumes remain there)
The volume was still typically attached to the old node in the AWS console
The VolumeAttachment object was still present in the cluster for the old node

The following workarounds can be used to detach the volume from the node:

Delete the VolumeAttachment object - this seems to reattach the disk and kick off the detach process successfully
Delete the node - the controller-manager attachdetach component will delete the VolumeAttachment object after 6 minutes
Restart the kubelet on the node - The kubelet then detects the correct state (ex. The volume is not mounted and not in use)

This issue is most likely to occur when a pod moves from one node to another, for example, when draining a node and during karpenter consolidation events.

During investigation, we saw the following calls in CloudWatch to the API server from the kubelet:

filter requestURI like "/api/v1/persistentvolumes/"
| fields @timestamp, requestURI, user.username
| sort @timestamp asc

Time	URL	Auth
2023-06-19 08:11:50.180	/api/v1/persistentvolumes/pvc-42d8283d-38c8-49af-a09b-0822b118b40d	system:node:ip-10-0-0-47.eu-west-1.compute.internal
2023-06-19 08:13:33.706	/api/v1/persistentvolumes/e706dce472d961b5b5ad8c80796434bb34d486f1ec17127a20f92b2084f4e61a	system:node:ip-10-0-0-137.eu-west-1.compute.internal

What stood out here, was the sha256 hash being requested from the API, in place of the expected pvc-* format. This only occurs when the error condition occurs and the volume is not detached from the node. It will occur periodically in the logs as the detach job has a backoff period.

Delving into the kubelet logs, we can see logs similar to the following:

csi_attacher.go:603] kubernetes.io/csi: attacher.UnmountDevice failed to get driver and volume name from device mount path: persistentvolumes "e706dce472d961b5b5ad8c80796434bb34d486f1ec17127a20f92b2084f4e61a" is forbidden: User "system:node:ip-10-0-0-137.eu-west-1.compute.internal" cannot get res

This is the point where the kubelet is trying to detach the volume and is failing when failing to read vol_data.json. What was interesting here, is that this code path does not exist in the release-1.27 codebase. With the latest patches, the kubelet should not be making this erroneous call to the api-server during an unmount.

PR https://github.com/kubernetes/kubernetes/pull/116138 removes the code path that is being taken by kubelet/v1.27.1 (linux/amd64) kubernetes/abfec7d, and the fix has been backported for other versions.

The invalid path was:

In the event the vol_data.json file did not exist
Call getDriverAndVolNameFromDeviceMountPath
This calls pv, err := k8s.CoreV1().PersistentVolumes().Get(context.TODO(), pvName, metav1.GetOptions{})
Where pvName is the sha256 hash of the VolumeHandler // dir is now /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}
The API returns a 403 Forbidden error, the PV doesn't exist and if it did, the kubelet would not have permission to access it
Return an error, and wait for the job to run again, which will fail indefinitely

So the observed behaviour becomes:

Due to an underlying (rare) bug in the kubelet volumemanager, resulting in multiple detach attempts after a pod is stopped, multiple detach attempts may be made for the same volume.
The first attempt will succeed, unmounting the volume, but the second attempt will fail with the persisentvolume API call error above, as the vol_data.json has been unmounted by the first attempt.
As the second job is failing, the volume remains marked as "attached" and "in-use" in the Nodes status.
The second job will retry forever, never succeeding.
As the volume is marked "attached" and "in-use" by the node, the controllermanager attachdetach component will not delete the VolumeAttachment object for Cloud API detachment, as it could potentially be unsafe to do so - unless the node is terminated.

The changes in the PR remove this erroneous code path, and return a nil error in the event of a missing vol_data.json which will result in a successful unmount.

It appears the EKS provided version of the kubelet is not patched with the PR commits to include the fix for this issue, otherwise, these log events would not occur.

The PR states this fix has been applied to the upstream branches, and can be checked by looking at the specific release- branch in the kubernetes/kubernetes repo:

FYI. this issue is fixed in 1.24.14, 1.25.10, 1.26.5, and has been fixed in v1.27.0

We have reproduced this behaviour in the following versions of the AWS kubelet, as recorded by the UserAgent string:

kubelet/v1.27.1 (linux/amd64) kubernetes/abfec7d
kubelet/v1.26.4 (linux/amd64) kubernetes/4a34796
kubelet/v1.25.9 (linux/amd64) kubernetes/a1a87a0
kubelet/v1.24.13 (linux/amd64) kubernetes/4943330

It looks like we will need AWS to rebuild the kubelet/AMIs with the required upstream patches to resolve this specific issue.

To aid in reproduction, we have created a repository available here https://github.com/martysweet/eks-ebs-node-detachment-reproduction, which provisions a bare-bones eks cluster based off Terraform blueprints. The python script can then be used to cycle the pod between two nodes. It will typically take less than 60 minutes before it becomes stuck on all of the above versions.

~ edit Further investigation shows that despite the PR suggesting it had been fixed in v1.27.0, this is not the case:

https://github.com/kubernetes/kubernetes/blob/v1.27.0/pkg/volume/csi/csi_attacher.go#L695 - Not applied
https://github.com/kubernetes/kubernetes/blob/v1.27.1/pkg/volume/csi/csi_attacher.go#L695 - Not applied
https://github.com/kubernetes/kubernetes/blob/v1.27.2/pkg/volume/csi/csi_attacher.go#L598 - Fixed

The other kubelet versions tested (using the default blueprint sourced AL2 AMIs) have patch versions lower than the PR mentions (ex . kubelet/v1.24.13 vs 1.24.14)

Jun 20 '23 09:06 martysweet

I can confirm the issue of long re-atachments for the addon v1.19.0-eksbuild.2 from kubernetes version 1.23 upwards and found kind of an easier workaround as I was impatiently updating a cluster: I deleted the volume.kubernetes.io/selected-node: [ip...] annotation from all pvc's prior to updating and the switch to the new nodes was almost smooth. I tested this for k8s versions 1.24 -> 1.25 -> 1.26. The 1.23 -> 1.24 update was my test update where i discovered this fix while the pods couldn't be rescheduled.

This bug is a real pain and in my opinion this should be a dealbreaker for everyone who tests eks thoroughly.

Jul 14 '23 11:07 ichbestimmtnicht

I have same issue with gp2 volume in 1.24, 1.25, 1.26. Pod restart process takes from 3 to 20 minutes, even if node hasn't changed. I had opened case and Support engineer from AWS suggested to add fsGroupChangePolicy: "OnRootMismatch" to securityContext of StatefulSet. More info -> https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/

This option helped me and pod started to run without significant delays but only if node hasn't been changed.

The same problem happens when node terminated gracefully by Karpenter, as AWS support engineer described - it is a race condition which is related to Karpenter. Issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1665

I tried to Drain node manually before deletion and deleted it. This helped and there were no re-attaching issues for my pod from StatefulSet. Probably this could be temporary fixed by integrating node-termination-handler.

Aug 14 '23 13:08 wallhided

For whoever is still impacted by this behaviour - make sure to update to the latest ebs csi driver version.

Sep 13 '23 15:09 shay-ul

aws-ebs-csi-driver aws-ebs-csi-driver copied to clipboard

PVC attaching takes much time

aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard