aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard
Could not detach/attach volumes to ec2 worker nodes - getting "context deadline exceeded" error - k8s 1.22.6 (self managed)
Could not detach attach/volume from ec2, error listing AWS instances
What happened? Upgraded the cluster from 1.18.8 to 1.22.6. After the worker nodes refresh. all STS pods are in pending state since it's unable to attach and detach the volumes to ec2 instance.
events from consul application pod (STS)
3m52s Normal ExternalProvisioning persistentvolumeclaim/data-cloudops-consul-consul-server-1 waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
116s Normal Provisioning persistentvolumeclaim/data-cloudops-consul-consul-server-1 External provisioner is provisioning volume for claim "cloudops/data-cloudops-consul-consul-server-1"
7m26s Warning ProvisioningFailed persistentvolumeclaim/data-cloudops-consul-consul-server-1 failed to provision volume with StorageClass "cloud-storage": rpc error: code = DeadlineExceeded desc = context deadline exceeded
3m52s Normal ExternalProvisioning persistentvolumeclaim/data-cloudops-consul-consul-server-2 waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
ebs controller pod
Detach
E0722 01:28:51.927386 1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-031a450d06e69a0d1" from node "i-0d9bfab3793000768": error listing AWS instances: "RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
ebs-csi-controller-6b494684d5-c8nwf csi-attacher I0722 01:28:51.927290 1 connection.go:186] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:29:05.857797 1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:29:15.858544 1 controller.go:981] error syncing claim "a6993436-624c-470d-8dba-e32227ddd770": failed to provision volume with StorageClass "cloud-storage": rpc error: code = DeadlineExceeded desc = context deadline exceeded
Attach
ebs-csi-controller-8c8df9c75-cd7vp csi-attacherebs-csi-controller-8c8df9c75-cd7vp csi-provisioner W0722 20:29:05.857784 1 controller.go:958] Retrying syncing claim "87c4264b-1f16-419a-af99-af8237b85a51", failure 8
ebs-csi-controller-8c8df9c75-cd7vp csi-attacher ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:29:05.857797 1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner I0722 20:29:05.857856 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"cloudops", Name:"data-cloudops-consul-consul-server-1", UID:"87c4264b-1f16-419a-af99-af8237b85a51", APIVersion:"v1", ResourceVersion:"270426797", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner W0722 20:32:20.449799 1 controller.go:958] Retrying syncing claim "87c4264b-1f16-419a-af99-af8237b85a51", failure 9
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:32:20.449820 1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner I0722 20:32:20.449840 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"cloudops", Name:"data-cloudops-consul-consul-server-1", UID:"87c4264b-1f16-419a-af99-af8237b85a51", APIVersion:"v1", ResourceVersion:"270426797", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner W0722 20:33:21.858061 1 controller.go:958] Retrying syncing claim "87c4264b-1f16-419a-af99-af8237b85a51", failure 10
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:33:21.858082 1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner I0722 20:33:21.858143 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"cloudops", Name:"data-cloudops-consul-consul-server-1", UID:"87c4264b-1f16-419a-af99-af8237b85a51", APIVersion:"v1", ResourceVersion:"270426797", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
Interestingly it's referring to a node ip-10-78-100-44.us-west-2.compute.internal
which doesn't exist anymore. maybe this node was present before the instance refresh.
from worker node:
syslog:
Jul 22 05:06:04 ip-10-78-102-43 kubelet[1152]: E0722 05:06:04.587435 1152 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-03aecee7d16a0e868 podName: nodeName:}" failed. No retries permitted until 2022-07-22 05:07:08.587402313 +0000 UTC m=+139.231499993 (durationBeforeRetry 1m4s). Error: Volume not attached according to node status for volume "pvc-d1360dd8-4648-425a-9da8-4286922fa844" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-03aecee7d16a0e868") pod "prometheus-k8s-pods-1" (UID: "a876f8b9-b696-4a12-a9ae-aeae96502684")
kubelet logs look okay
Jul 22 20:16:57 ip-10-78-102-95 kubelet[2800]: I0722 20:16:57.616596 2800 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: ebs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock
Jul 22 20:17:02 ip-10-78-102-95 kubelet[2800]: I0722 20:17:02.892470 2800 topology_manager.go:200] "Topology Admit Handler"
Policy permissions
"Action": [
"ec2:Describe*",
"ec2:AttachVolume",
"ec2:DetachVolume",
"ec2:CreateRoute",
"ec2:DeleteRoute",
"ec2:ReplaceRoute",
"ec2:ModifyInstanceAttribute",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:CreateNetworkInterface",
"ec2:AttachNetworkInterface",
"ec2:DeleteNetworkInterface",
"ec2:DetachNetworkInterface",
"ec2:AssignPrivateIpAddresses",
"ec2:UnassignPrivateIpAddresses",
"ec2:CreateSnapshot",
"ec2:CreateSnapshots",
"ec2:DeleteSnapshot",
"ec2:CreateTags",
"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"
],
"Effect": "Allow",
"Resource": [
"*"
],
"Sid": "Ec2Restrictions"
},
What you expected to happen? It should be able to attach and detach volumes.
Anything else we need to know?:
Environment
- Kubernetes version (use
kubectl version
): 1.22.6 - Driver version: 1.0.0
Resolution: coredns svc was having incorrect label with that of coredns deployment. so EP didn't register any of the coredns IP addresses, even though all pods were running. EBS CSI driver was unable to resolve nodes names causing this issue.
i would expect a better error message which can point to resolution failure rather than just 'ProvisioningFailed' failed to get target node
as this also means IAM worker node permission issue.
I seem to have same issue. I tried changing labels but that didn't appear to work. Which labels exactly did you change to resolve? And what exactly did you change them to? And how do you check if EP has registered coredns IPs?
Label: k8s-app=coredns
earlier it was different.
kubectl describe ep kube-dns -n kube-system
should give you registered coredns IP's.
changing the label made no difference. Still getting similar errors to the above. And when I list the EPs they list correctly (never mind if I change the label or not). I am guessing it is still a DNS error, as its still a timeout for me. Not sure how to analyse the issue further. If I connect to the host or exec into a pod, how I can confirm I can reach the EC2 Api calls? I upgraded one of our clusters to eks 1.23 (luckily only monitoring on this cluster uses ebs volumes, but they are broken). I may have to recreate a 1.22 cluster and migrate across to it, but I dont want to do that. Everything else works, any suggestions for what I should try next?
ok. I got it working. I edited the deploy files for the csi controllers.
I set hostNetwork: true
and dnsPolicy: ClusterFirstWithHostNet
.
I am using a secondary subnet range for the pods, is that why I needed to do this?
@ilanpillemer Glad to hear you were able to get it working. dnsPolicy: ClusterFirstWithHostNet
changes the DNS resolver to the cluster's DNS which is an interesting point to note, thanks for providing your fix. For anyone that is experiencing this issue, I recommend enabling the --aws-sdk-debug-log=true
option to get more insight into whats going on. I've seen this come up a few times and restarting the coreDNS deployment seems to fix it.
Thanks @torredil, restarting the coreDNS deloyment did not work for me. The only thing that worked for me was the change to use the Cluster's DNS. However I am using the CNI plugin to allow secondary IP addresses for the pods, as the Primary subnet does not have enough IP addresses to support kubernetes; so I suspect its somehow the interaction between these two plugins that required me to change that setting, which may be unique to the networking in our AWS accounts.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.