aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard
PVC attaching takes much time
Duplicating case https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1279 as we observe the same issue when detaching/deleting and attaching PVC to a new node takes more than 6 min and it's followed by a timeout
What happened?
Normal Scheduled 7m21s default-scheduler Successfully assigned dev-test/bigid-cache-0 to ip-13-118-32-127.ec2.internal
Warning FailedAttachVolume 7m21s attachdetach-controller Multi-Attach error for volume "pvc-xxx" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 3m2s (x2 over 5m18s) kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data], unattached volumes=[health redis-data config redis-tmp-conf tmp kube-api-access-98lr9 start-scripts]: timed out waiting for the condition
Normal SuccessfulAttachVolume 80s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-xxx"
Normal Pulling 64s kubelet Pulling image "docker.io/bitnami/redis:6.2.7-debian-10-r32"
Normal Pulled 62s kubelet Successfully pulled image "docker.io/bitnami/redis:6.2.7-debian-10-r32" in 2.812050986s
Normal Created 62s kubelet Created container redis
Normal Started 62s kubelet Started container redis
EBS CSI Controller logs Failure to Delete EBS volume after Detach
I0706 19:22:20.580181 1 request.go:665] Waited for 1.119392703s due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3/status
I0706 19:22:21.580381 1 request.go:665] Waited for 1.331653146s due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-1973672463aeb63fb239e6370c04bb7df8bb50290d8d9f2a0d3690d134f3521a/status
I0706 19:22:21.786329 1 csi_handler.go:286] Failed to save detach error to "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3": volumeattachments.storage.k8s.io "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3" not found
I0706 19:22:21.786354 1 csi_handler.go:231] Error processing "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3" not found
I0706 19:22:21.877958 1 csi_handler.go:261] Attached "csi-d0cb7bac7a53511c451712aaa8bc8bac79adc2d3426d993d1f14a135d84fb5ab"
I0706 19:22:21.986228 1 csi_handler.go:286] Failed to save detach error to "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d": volumeattachments.storage.k8s.io "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d" not found
I0706 19:22:21.986254 1 csi_handler.go:231] Error processing "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d" not found
What you expected to happen? It should not take so much time for PVC detaching and attaching to new k8s node. Compare to default Kubernetes Storage driver (gp2) where I do not see this issue it take a lot of time
How to reproduce it (as minimally and precisely as possible)? You just need to terminate EC2 instance (k8s node) directly which statefulset is running on
Anything else we need to know?:
Environment
- Kubernetes version (use
kubectl version
): 1.22 - Driver version: 1.6.2
I believe this is expected behavior, if you terminate your EC2 instance ungracefully (i.e. without safely evicting all of your pods through kubectl drain
) the pods that are part of a StatefulSet
will be stuck terminating on the shutdown node and cannot move to a new running node since the kubelet is not available to delete the pods. If there are volumes used by the pods, the VolumeAttachment
objects will not be deleted from the original node so the volumes used by these pods cannot be attached to a new running node. There is work being done upstream to address this, K8s v1.24
introduces alpha support for Non-Graceful Node Shutdown which would reduce this time, for more information see the KEP and https://github.com/kubernetes/kubernetes/pull/108486.
The 6 minute wait time before timing out is due to the default configuration of Attach/Detach controller timers. More specifically, ReconcilerMaxWaitForUnmountDuration which is the max amount of time the attach/detach controller waits for a volume to be safely unmounted from a node. After 6 minutes, the controller assumes the node is unresponsive and attempts to detach the volume.
is ReconcilerMaxWaitForUnmountDuration configurable ? is it part of the control plane ? the same issue happens to us but not on ungraceful shutdown. it takes a long time for PVC to get reattached to the new node after a pod reschedules.
@amitde69 ReconcilerMaxWaitForUnmountDuration is currently not configurable: https://github.com/kubernetes/kubernetes/issues/105341.
The previously mentioned KEP addresses this 6 minute wait time by issuing a force detach if the out-of-service
taint is present on the node.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
+1 this makes some apps unusable (e.g. strimzi kafka operator) because of the frequent timeouts. The same works fine on the default Kubernetes Storage driver (gp2).
I'm still seeing this /remove-lifecycle stale
I'm still seeing this
K8s AWS EKS 1.22.11
EBS image v1.13.0
/remove-lifecycle stale
Experiencing the same behavior, and this is very concerning as this is a clear regression compared to previous versions of EKS.
We could observe the issue with a pod scheduled with Karpenter. Karpenter cordoned & drained the node on which the pod was running (during consolidation), and it took several minutes for the pod to get back to the running state. The pod event log looks exactly the same as in the original issue report.
I want to make it clear that the node was not terminated ungracefully.
We're experiencing this on EKS 1.24 + EBS 1.12.
We use node-termination-handler to handle the graceful draining of the nodes. We observed that the volume becomes detached from the old node through AWS console, but the attachdetach controller doesn't detach the volume on the kubernetes api.
This seems to be an issue on attachdetach controller to be honest. I'll try to dig out what maybe happening.
Same here. EKS 1.24 Karpenter 0.23.0 ebs-csi-driver 1.15.0
/remove-lifecycle stale
Unfortunately, we see this problem as well, and it has a severe negative impact on our workload. Has anyone come up with a solution or workaround yet?
/kind bug
@torredil @ConnorJC3 comments + thumbs up deduplicated, there are now 12 people that found their way to this Github issue to report this exact same issue. And I suspect this will grow in number with EKS 1.22 ending support in June 2023 and EBS CSI Driver becoming mandatory in EKS 1.23.
Can we please have an official word on this? Do you confirm the issue, or do you still think this is an expected, perfectly normal behavior? Do you maybe lack resource to look into this, should we alert our AWS account reps to get more attention on this?
Hi all, I can't make any specific promises before I get a chance to look into it, but I will investigate this bug sometime during the next work week.
In the meantime, can anyone affected here please clarify that this issue is happening to them (only) when they terminate the EC2 instance behind a node without draining it first?
If you can reproduce this under other scenarios that are different, please provide information on how to do that, thanks!
Thanks!
Running a "kubectl drain
For me killing a pod of a stateful set with a PVC attached is enough already to reproduce.
I have this issue every time when we starting process of nodes replacement when updating AWS launch templates. Killing a pods from StatefulSet usually not enough so we also kill pods with EBS CSI Controller and then issue disappears.
Hi all, I can't make any specific promises before I get a chance to look into it, but I will investigate this bug sometime during the next work week.
In the meantime, can anyone affected here please clarify that this issue is happening to them (only) when they terminate the EC2 instance behind a node without draining it first?
If you can reproduce this under other scenarios that are different, please provide information on how to do that, thanks!
Hi @ConnorJC3
Anything else you need, you let us know!
Hi all, sorry about the somewhat late response. I have been trying to reproduce this locally via methods such as kubectl drain
, but I cannot seem to do so.
If anyone experiencing this issue not related to an ungraceful shutdown (ungraceful shutdowns causing this is expected behavior and out of our control) could provide one of the two, that would be very helpful:
- A minimum set of manifests to reproduce the issue with some level of consistency OR
- Logs from the EBS CSI Controller
ebs-plugin
andcsi-attacher
containers with the log level for both turned up to 4 when the issue occurs
Not got reproduction steps per se, but running stateful sets with pv/pvcs on karpenter on spot instances with a node lifespan TTL of 15mins is a nice way to force the issue to happen quickly, as nodes are constantly churning and re-attaching to the PVs takes a while.
Using Spot instances on AWS and when an instance is terminated the same issue occurs
EKS: 1.22 ebs-csi-driver: v1.17 node-termination-handler: v1.19
I don't know what changed exactly, but I cannot reproduce the issue after upgrading to the driver's latest version.
Reverting to an older version like 1.12.1 and the issue is back.
I will upgrade all production clusters to the latest version and keep you updated if the issue is back. Meanwhile, would be nice if anyone else can confirm the issue is gone after upgrading to the driver latest version.
@maximethebault After upgrading your production clusters, are you still seeing the issue?
I could confirm that this issue is still exist with EKS 1.25 / 1.26 and latest EBS driver version 1.18.0.
Hi all, I also faced with PVC(EBS) attach timeouts. Sometimes PVC attachment takes up to 1 hour. In csi-attacher container logs I see lots of "client-side throttling". Is there the way to increase request's quota on csi-attacher side ?
I0515 15:12:23.707456 1 request.go:682] Waited for 3.792550031s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-485c1d77-30e1-4006-bad4-396eeee17a6b
I0515 15:12:24.907002 1 request.go:682] Waited for 3.766319675s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-cef2f6e7-df16-4d1d-ad50-7981093c78ad
I0515 15:12:25.907372 1 request.go:682] Waited for 3.392822052s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-20aa26b3-f495-456e-a284-d9327bf73a84
I0515 15:12:27.107446 1 request.go:682] Waited for 3.192762704s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-591f7379-5c9f-45f1-a42d-ef53808d0684
PVC attaching delay fixed by increasing attacher and provisioner qps and burst . --kube-api-burst=20 --kube-api-qps=10.0
Is this related? https://github.com/kubernetes-csi/external-attacher/issues/215
In our tests, we were able to reproduce an issue where a volume remained attached to a node after the pod was deleted. This issue is relatively easy to reproduce (within a few hundred volume attachments) and would not clear itself without manual intervention.
It's possible there are multiple issues which users are experiencing in this thread, however, this one was the most disruptive to our users.
This reproduction is visible in the following places:
- Node.volumesAttached and Node.volumesInUse were not empty on the old node (the stuck volumes remain there)
- The volume was still typically attached to the old node in the AWS console
- The VolumeAttachment object was still present in the cluster for the old node
The following workarounds can be used to detach the volume from the node:
- Delete the VolumeAttachment object - this seems to reattach the disk and kick off the detach process successfully
- Delete the node - the controller-manager attachdetach component will delete the VolumeAttachment object after 6 minutes
- Restart the kubelet on the node - The kubelet then detects the correct state (ex. The volume is not mounted and not in use)
This issue is most likely to occur when a pod moves from one node to another, for example, when draining a node and during karpenter consolidation events.
During investigation, we saw the following calls in CloudWatch to the API server from the kubelet:
filter requestURI like "/api/v1/persistentvolumes/"
| fields @timestamp, requestURI, user.username
| sort @timestamp asc
Time | URL | Auth |
---|---|---|
2023-06-19 08:11:50.180 | /api/v1/persistentvolumes/pvc-42d8283d-38c8-49af-a09b-0822b118b40d | system:node:ip-10-0-0-47.eu-west-1.compute.internal |
2023-06-19 08:13:33.706 | /api/v1/persistentvolumes/e706dce472d961b5b5ad8c80796434bb34d486f1ec17127a20f92b2084f4e61a | system:node:ip-10-0-0-137.eu-west-1.compute.internal |
What stood out here, was the sha256 hash being requested from the API, in place of the expected pvc-*
format. This only occurs when the error condition occurs and the volume is not detached from the node. It will occur periodically in the logs as the detach job has a backoff period.
Delving into the kubelet logs, we can see logs similar to the following:
csi_attacher.go:603] kubernetes.io/csi: attacher.UnmountDevice failed to get driver and volume name from device mount path: persistentvolumes "e706dce472d961b5b5ad8c80796434bb34d486f1ec17127a20f92b2084f4e61a" is forbidden: User "system:node:ip-10-0-0-137.eu-west-1.compute.internal" cannot get res
This is the point where the kubelet is trying to detach the volume and is failing when failing to read vol_data.json
. What was interesting here, is that this code path does not exist in the release-1.27 codebase. With the latest patches, the kubelet should not be making this erroneous call to the api-server during an unmount.
PR https://github.com/kubernetes/kubernetes/pull/116138 removes the code path that is being taken by kubelet/v1.27.1 (linux/amd64) kubernetes/abfec7d
, and the fix has been backported for other versions.
The invalid path was:
- In the event the
vol_data.json
file did not exist - Call
getDriverAndVolNameFromDeviceMountPath
- This calls
pv, err := k8s.CoreV1().PersistentVolumes().Get(context.TODO(), pvName, metav1.GetOptions{})
- Where
pvName
is the sha256 hash of theVolumeHandler
// dir is now /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}
- The API returns a 403 Forbidden error, the PV doesn't exist and if it did, the kubelet would not have permission to access it
- Return an error, and wait for the job to run again, which will fail indefinitely
So the observed behaviour becomes:
- Due to an underlying (rare) bug in the kubelet volumemanager, resulting in multiple detach attempts after a pod is stopped, multiple detach attempts may be made for the same volume.
- The first attempt will succeed, unmounting the volume, but the second attempt will fail with the
persisentvolume
API call error above, as thevol_data.json
has been unmounted by the first attempt. - As the second job is failing, the volume remains marked as "attached" and "in-use" in the Nodes status.
- The second job will retry forever, never succeeding.
- As the volume is marked "attached" and "in-use" by the node, the controllermanager attachdetach component will not delete the VolumeAttachment object for Cloud API detachment, as it could potentially be unsafe to do so - unless the node is terminated.
The changes in the PR remove this erroneous code path, and return a nil
error in the event of a missing vol_data.json
which will result in a successful unmount.
It appears the EKS provided version of the kubelet is not patched with the PR commits to include the fix for this issue, otherwise, these log events would not occur.
The PR states this fix has been applied to the upstream branches, and can be checked by looking at the specific release-
branch in the kubernetes/kubernetes
repo:
FYI. this issue is fixed in 1.24.14, 1.25.10, 1.26.5, and has been fixed in v1.27.0
We have reproduced this behaviour in the following versions of the AWS kubelet, as recorded by the UserAgent string:
-
kubelet/v1.27.1 (linux/amd64) kubernetes/abfec7d
-
kubelet/v1.26.4 (linux/amd64) kubernetes/4a34796
-
kubelet/v1.25.9 (linux/amd64) kubernetes/a1a87a0
-
kubelet/v1.24.13 (linux/amd64) kubernetes/4943330
It looks like we will need AWS to rebuild the kubelet/AMIs with the required upstream patches to resolve this specific issue.
To aid in reproduction, we have created a repository available here https://github.com/martysweet/eks-ebs-node-detachment-reproduction, which provisions a bare-bones eks cluster based off Terraform blueprints. The python script can then be used to cycle the pod between two nodes. It will typically take less than 60 minutes before it becomes stuck on all of the above versions.
~ edit Further investigation shows that despite the PR suggesting it had been fixed in v1.27.0, this is not the case:
- https://github.com/kubernetes/kubernetes/blob/v1.27.0/pkg/volume/csi/csi_attacher.go#L695 - Not applied
- https://github.com/kubernetes/kubernetes/blob/v1.27.1/pkg/volume/csi/csi_attacher.go#L695 - Not applied
- https://github.com/kubernetes/kubernetes/blob/v1.27.2/pkg/volume/csi/csi_attacher.go#L598 - Fixed
The other kubelet versions tested (using the default blueprint sourced AL2 AMIs) have patch versions lower than the PR mentions (ex . kubelet/v1.24.13 vs 1.24.14)
I can confirm the issue of long re-atachments for the addon v1.19.0-eksbuild.2 from kubernetes version 1.23 upwards and found kind of an easier workaround as I was impatiently updating a cluster: I deleted the volume.kubernetes.io/selected-node: [ip...]
annotation from all pvc's prior to updating and the switch to the new nodes was almost smooth.
I tested this for k8s versions 1.24 -> 1.25 -> 1.26. The 1.23 -> 1.24 update was my test update where i discovered this fix while the pods couldn't be rescheduled.
This bug is a real pain and in my opinion this should be a dealbreaker for everyone who tests eks thoroughly.
I have same issue with gp2 volume in 1.24, 1.25, 1.26. Pod restart process takes from 3 to 20 minutes, even if node hasn't changed. I had opened case and Support engineer from AWS suggested to add
fsGroupChangePolicy: "OnRootMismatch"
to securityContext of StatefulSet.
More info -> https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/
This option helped me and pod started to run without significant delays but only if node hasn't been changed.
The same problem happens when node terminated gracefully by Karpenter, as AWS support engineer described - it is a race condition which is related to Karpenter. Issue: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1665
I tried to Drain node manually before deletion and deleted it. This helped and there were no re-attaching issues for my pod from StatefulSet. Probably this could be temporary fixed by integrating node-termination-handler.
For whoever is still impacted by this behaviour - make sure to update to the latest ebs csi driver version.