aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard
Failed to AttachVolume, expected device to be attached but was attaching
/kind bug
What happened?
Getting Pod error FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-XXX" : rpc error: code = Internal desc = Could not attach volume "vol-XXX" to node "i-XXX": attachment of disk "vol-XX" failed, expected device to be attached but was attaching
Error log from ebs-csi-controller:
I1025 11:08:34.489703 1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-XXX" nodeID="i-XXX"
E1025 11:08:49.490233 1 driver.go:124] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"vol-XXX\" to node \"i-XXX\": context canceled"
I1025 11:08:49.497484 1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-XXX" nodeID="i-XXX"
E1025 11:08:49.629393 1 driver.go:124] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"vol-XXX\" to node \"i-XXX\": attachment of disk \"vol-XXX\" failed, expected device to be attached but was attaching"
What you expected to happen?
EBS Volumes to be attached successfully to a node when a PVC is created.
Anything else we need to know?:
I see that actually the volume is attached to Node. p.s. it doesn't happen with all volumes
Environment
- Kubernetes version: v1.27.4-eks-2d98532
- Driver version: v1.24.0
Do volumes get stuck in this state indefinitely (or for an abnormally long time)? It's usual to see this message if a volume is taking a long time to attach on EBS's side, but it should eventually fix itself after the volume becomes attached.
on AWS console I see that volume is attached to the node, but the controller keeps trying to attach it
seems like the problem was with reaching limit of EBS Volumes per Node. But after setting volumeAttachLimit, the pod is still scheduled on the node which already reached the attached volumes limit.
The Kubernetes scheduler is the component responsible for scheduling pods, the CSI driver simply reports the attachment limit on the node once during plugin registration.
Once the scheduler incorrectly schedules a pod on a node with no attachment capacity, user intervention is required in the form of deleting the affected pod so that K8s may retry again.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hello,
Any other workaround besides a manual intervention? Can this be consider as a feature request to have EBS CSI Controller to throw an error or better yet have a way to notify the scheduler to re-schedule the pod onto a different worker node?
I want to deploy postgres on k8s cluster made by Kubeadm and getting the issue
"E0718 13:31:52.030062 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 13:31:52.030114 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume "vol-xxxxxxxxxxxxx" to node "i-xxxxxxxxxxxxx": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded""
And also I have created pv and pvc which is bounded with the help of ebs-csi-driver and I have deployed csi friver using helm.
"ubuntu@k8smaster:~/application/postgres$ kubectl get pvc -n app NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE postgres-pvc Bound existing-ebs-pv 60Gi RWO ebs-postgres 115m ubuntu@k8smaster:~/application/postgres$ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE existing-ebs-pv 60Gi RWO Retain Bound app/postgres-pvc ebs-postgres 115m"
AND
I have correctly configured the iam role and attach to all nodes with this policy " { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:CreateSnapshot", "ec2:CreateTags", "ec2:CreateVolume", "ec2:DeleteSnapshot", "ec2:DeleteTags", "ec2:DeleteVolume", "ec2:DescribeAvailabilityZones", "ec2:DescribeInstances", "ec2:DescribeSnapshots", "ec2:DescribeTags", "ec2:DescribeVolumes", "ec2:DescribeVolumesModifications", "ec2:AttachVolume", "ec2:DetachVolume" ], "Resource": "*" } ] } "
can any one tell what is the issue ?
@divya898 Deploying the driver with SDK debug mode enabled and taking a look at the logs once again will be very helpful for debugging the issue.
@torredil Ya I have used this command helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver and install the csi
pod/ebs-csi-controller-86957f8467-785bg 5/5 Running 0 5h54m pod/ebs-csi-controller-86957f8467-vxcmc 5/5 Running 0 5h54m pod/ebs-csi-node-4jkrz 3/3 Running 0 5h54m pod/ebs-csi-node-kk6qw 3/3 Running 0 5h54m pod/ebs-csi-node-zrljn 3/3 Running 0 5h54m
and I can able to bound with pvc and pv But the issue persist pod are in creating state
Events: Type Reason Age From Message
Normal Scheduled 72s default-scheduler Successfully assigned app/postgres-777d8c6765-d2tct to k8sworker2.example.net Warning FailedAttachVolume 4s (x8 over 71s) attachdetach-controller AttachVolume.Attach failed for volume "existing-ebs-pv" : volume attachment is being deleted
Logs of CSI
controller pod :-
E0718 17:06:24.989148 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 17:06:24.989220 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" I0718 17:36:25.048006 1 controller.go:464] "ControllerUnpublishVolume: detaching" volumeID="vol-02a0b78825ece9614" nodeID="i-0492a4a4bf1856321" E0718 17:36:55.551709 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 17:36:55.551774 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" I0718 17:38:21.934625 1 controller.go:464] "ControllerUnpublishVolume: detaching" volumeID="vol-02a0b78825ece9614" nodeID="i-0492a4a4bf1856321" E0718 17:38:52.436619 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded" E0718 17:38:52.436696 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-controller-86957f8467-vxcmc --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, csi-provisioner, csi-attacher, csi-resizer, liveness-probe I0718 11:18:13.486934 1 main.go:157] "Initializing metadata" I0718 11:18:13.509506 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:13.510885 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" ubuntu@k8smaster:~/application/postgres$
ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-4jkrz --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe I0718 11:18:12.593894 1 main.go:157] "Initializing metadata" I0718 11:18:12.633046 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:12.633999 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" E0718 11:18:13.644744 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net" E0718 11:18:14.152630 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net" E0718 11:18:15.159355 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net" I0718 11:18:17.167929 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8smaster.example.net" count=38
ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-kk6qw --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe I0718 11:18:12.840835 1 main.go:157] "Initializing metadata" I0718 11:18:12.876464 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:12.878910 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" E0718 11:18:13.905032 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net" E0718 11:18:14.413693 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net" E0718 11:18:15.426821 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net" I0718 11:18:17.433850 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8sworker2.example.net" count=26
ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-zrljn --tail=100 -n kube-system Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe I0718 11:18:11.396228 1 main.go:157] "Initializing metadata" I0718 11:18:11.405170 1 metadata.go:48] "Retrieved metadata from IMDS" I0718 11:18:11.405756 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0" E0718 11:18:12.418598 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net" E0718 11:18:12.926204 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net" E0718 11:18:13.932893 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net" I0718 11:18:15.939380 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8sworker1.example.net" count=38
Please tell what i am missing ?
@divya898 It looks like SDK debug logging has not been enabled, you can do so via the controller.sdkDebugLog=true
parameter. Example:
helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.sdkDebugLog=true
you should be able to see more detailed information that will help troubleshoot why the HTTP request is timing out directly in the CSI controller logs:
Action=DescribeInstances&InstanceId.1=i-0c73a509f58829453&Version=2016-11-15
SDK 2024/07/18 17:51:31 DEBUG Response
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Cache-Control: no-cache, no-store
Content-Type: text/xml;charset=UTF-8
Date: Thu, 18 Jul 2024 17:51:31 GMT
Server: AmazonEC2
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: accept-encoding
X-Amzn-Requestid: 693436ba-xxxx-xxxx-xxxx-a03a27ea5640
...
Hi @torredil thankyou for your time I can see my issue is solved by restarting the csi-driver pod by updating the helm repo .
@divya898 Glad to hear 👍
Let us know if you run into any other issues and we'll be happy to help.