aws-ebs-csi-driver Failed to AttachVolume, expected device to be attached but was attaching

/kind bug

What happened?

Getting Pod error FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-XXX" : rpc error: code = Internal desc = Could not attach volume "vol-XXX" to node "i-XXX": attachment of disk "vol-XX" failed, expected device to be attached but was attaching

Error log from ebs-csi-controller:

I1025 11:08:34.489703       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-XXX" nodeID="i-XXX"
E1025 11:08:49.490233       1 driver.go:124] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"vol-XXX\" to node \"i-XXX\": context canceled"
I1025 11:08:49.497484       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-XXX" nodeID="i-XXX"
E1025 11:08:49.629393       1 driver.go:124] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"vol-XXX\" to node \"i-XXX\": attachment of disk \"vol-XXX\" failed, expected device to be attached but was attaching"

What you expected to happen?

EBS Volumes to be attached successfully to a node when a PVC is created.

Anything else we need to know?:

I see that actually the volume is attached to Node. p.s. it doesn't happen with all volumes

Environment

Kubernetes version: v1.27.4-eks-2d98532
Driver version: v1.24.0

Oct 25 '23 11:10 adovy

Do volumes get stuck in this state indefinitely (or for an abnormally long time)? It's usual to see this message if a volume is taking a long time to attach on EBS's side, but it should eventually fix itself after the volume becomes attached.

Oct 30 '23 14:10 ConnorJC3

on AWS console I see that volume is attached to the node, but the controller keeps trying to attach it

Oct 30 '23 15:10 adovy

seems like the problem was with reaching limit of EBS Volumes per Node. But after setting volumeAttachLimit, the pod is still scheduled on the node which already reached the attached volumes limit.

Nov 15 '23 14:11 adovy

The Kubernetes scheduler is the component responsible for scheduling pods, the CSI driver simply reports the attachment limit on the node once during plugin registration.

Once the scheduler incorrectly schedules a pod on a node with no attachment capacity, user intervention is required in the form of deleting the affected pod so that K8s may retry again.

Nov 15 '23 16:11 torredil

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 13 '24 16:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 14 '24 16:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 13 '24 17:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 13 '24 17:04 k8s-ci-robot

Hello,

Any other workaround besides a manual intervention? Can this be consider as a feature request to have EBS CSI Controller to throw an error or better yet have a way to notify the scheduler to re-schedule the pod onto a different worker node?

May 07 '24 05:05 aws-patrickc

I want to deploy postgres on k8s cluster made by Kubeadm and getting the issue

"E0718 13:31:52.030062 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 13:31:52.030114 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume "vol-xxxxxxxxxxxxx" to node "i-xxxxxxxxxxxxx": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded""

And also I have created pv and pvc which is bounded with the help of ebs-csi-driver and I have deployed csi friver using helm.

"ubuntu@k8smaster:~/application/postgres$ kubectl get pvc -n app NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE postgres-pvc Bound existing-ebs-pv 60Gi RWO ebs-postgres 115m ubuntu@k8smaster:~/application/postgres$ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE existing-ebs-pv 60Gi RWO Retain Bound app/postgres-pvc ebs-postgres 115m"

AND

I have correctly configured the iam role and attach to all nodes with this policy " { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:CreateSnapshot", "ec2:CreateTags", "ec2:CreateVolume", "ec2:DeleteSnapshot", "ec2:DeleteTags", "ec2:DeleteVolume", "ec2:DescribeAvailabilityZones", "ec2:DescribeInstances", "ec2:DescribeSnapshots", "ec2:DescribeTags", "ec2:DescribeVolumes", "ec2:DescribeVolumesModifications", "ec2:AttachVolume", "ec2:DetachVolume" ], "Resource": "*" } ] } "

can any one tell what is the issue ?

Jul 18 '24 15:07 divya898

@divya898 Deploying the driver with SDK debug mode enabled and taking a look at the logs once again will be very helpful for debugging the issue.

Jul 18 '24 15:07 torredil

@torredil Ya I have used this command helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver and install the csi

pod/ebs-csi-controller-86957f8467-785bg 5/5 Running 0 5h54m pod/ebs-csi-controller-86957f8467-vxcmc 5/5 Running 0 5h54m pod/ebs-csi-node-4jkrz 3/3 Running 0 5h54m pod/ebs-csi-node-kk6qw 3/3 Running 0 5h54m pod/ebs-csi-node-zrljn 3/3 Running 0 5h54m

and I can able to bound with pvc and pv But the issue persist pod are in creating state

Events: Type Reason Age From Message

Normal Scheduled 72s default-scheduler Successfully assigned app/postgres-777d8c6765-d2tct to k8sworker2.example.net Warning FailedAttachVolume 4s (x8 over 71s) attachdetach-controller AttachVolume.Attach failed for volume "existing-ebs-pv" : volume attachment is being deleted

Logs of CSI

controller pod :-

E0718 17:06:24.989148 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 17:06:24.989220 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" I0718 17:36:25.048006 1 controller.go:464] "ControllerUnpublishVolume: detaching" volumeID="vol-02a0b78825ece9614" nodeID="i-0492a4a4bf1856321" E0718 17:36:55.551709 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 17:36:55.551774 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" I0718 17:38:21.934625 1 controller.go:464] "ControllerUnpublishVolume: detaching" volumeID="vol-02a0b78825ece9614" nodeID="i-0492a4a4bf1856321" E0718 17:38:52.436619 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 17:38:52.436696 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"

ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-controller-86957f8467-vxcmc --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, csi-provisioner, csi-attacher, csi-resizer, liveness-probe I0718 11:18:13.486934 1 main.go:157] "Initializing metadata" I0718 11:18:13.509506 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:13.510885 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" ubuntu@k8smaster:~/application/postgres$

ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-4jkrz --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe I0718 11:18:12.593894 1 main.go:157] "Initializing metadata" I0718 11:18:12.633046 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:12.633999 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" E0718 11:18:13.644744 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net" E0718 11:18:14.152630 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net" E0718 11:18:15.159355 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net" I0718 11:18:17.167929 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8smaster.example.net" count=38

ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-kk6qw --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe I0718 11:18:12.840835 1 main.go:157] "Initializing metadata" I0718 11:18:12.876464 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:12.878910 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" E0718 11:18:13.905032 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net" E0718 11:18:14.413693 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net" E0718 11:18:15.426821 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net" I0718 11:18:17.433850 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8sworker2.example.net" count=26

ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-zrljn --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe I0718 11:18:11.396228 1 main.go:157] "Initializing metadata" I0718 11:18:11.405170 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:11.405756 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" E0718 11:18:12.418598 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net" E0718 11:18:12.926204 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net" E0718 11:18:13.932893 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net" I0718 11:18:15.939380 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8sworker1.example.net" count=38

Please tell what i am missing ?

Jul 18 '24 17:07 divya898

@divya898 It looks like SDK debug logging has not been enabled, you can do so via the controller.sdkDebugLog=true parameter. Example:

helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.sdkDebugLog=true

you should be able to see more detailed information that will help troubleshoot why the HTTP request is timing out directly in the CSI controller logs:

Action=DescribeInstances&InstanceId.1=i-0c73a509f58829453&Version=2016-11-15
SDK 2024/07/18 17:51:31 DEBUG Response
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Cache-Control: no-cache, no-store
Content-Type: text/xml;charset=UTF-8
Date: Thu, 18 Jul 2024 17:51:31 GMT
Server: AmazonEC2
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: accept-encoding
X-Amzn-Requestid: 693436ba-xxxx-xxxx-xxxx-a03a27ea5640
...

Jul 18 '24 17:07 torredil

Hi @torredil thankyou for your time I can see my issue is solved by restarting the csi-driver pod by updating the helm repo .

Jul 18 '24 18:07 divya898

@divya898 Glad to hear 👍

Let us know if you run into any other issues and we'll be happy to help.

Jul 18 '24 18:07 torredil

aws-ebs-csi-driver aws-ebs-csi-driver copied to clipboard

Failed to AttachVolume, expected device to be attached but was attaching

aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard