aws-ebs-csi-driver icon indicating copy to clipboard operation
aws-ebs-csi-driver copied to clipboard

Could not detach/attach volumes to ec2 worker nodes - getting "context deadline exceeded" error - k8s 1.22.6 (self managed)

Open shomeprasanjit opened this issue 2 years ago • 7 comments

Could not detach attach/volume from ec2, error listing AWS instances

What happened? Upgraded the cluster from 1.18.8 to 1.22.6. After the worker nodes refresh. all STS pods are in pending state since it's unable to attach and detach the volumes to ec2 instance.

events from consul application pod (STS)

3m52s       Normal    ExternalProvisioning           persistentvolumeclaim/data-cloudops-consul-consul-server-1   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
116s        Normal    Provisioning                   persistentvolumeclaim/data-cloudops-consul-consul-server-1   External provisioner is provisioning volume for claim "cloudops/data-cloudops-consul-consul-server-1"
7m26s       Warning   ProvisioningFailed             persistentvolumeclaim/data-cloudops-consul-consul-server-1   failed to provision volume with StorageClass "cloud-storage": rpc error: code = DeadlineExceeded desc = context deadline exceeded
3m52s       Normal    ExternalProvisioning           persistentvolumeclaim/data-cloudops-consul-consul-server-2   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

ebs controller pod

Detach

E0722 01:28:51.927386       1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-031a450d06e69a0d1" from node "i-0d9bfab3793000768": error listing AWS instances: "RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
ebs-csi-controller-6b494684d5-c8nwf csi-attacher I0722 01:28:51.927290       1 connection.go:186] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:29:05.857797       1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:29:15.858544       1 controller.go:981] error syncing claim "a6993436-624c-470d-8dba-e32227ddd770": failed to provision volume with StorageClass "cloud-storage": rpc error: code = DeadlineExceeded desc = context deadline exceeded

Attach

ebs-csi-controller-8c8df9c75-cd7vp csi-attacherebs-csi-controller-8c8df9c75-cd7vp csi-provisioner W0722 20:29:05.857784       1 controller.go:958] Retrying syncing claim "87c4264b-1f16-419a-af99-af8237b85a51", failure 8
ebs-csi-controller-8c8df9c75-cd7vp csi-attacher ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:29:05.857797       1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner I0722 20:29:05.857856       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"cloudops", Name:"data-cloudops-consul-consul-server-1", UID:"87c4264b-1f16-419a-af99-af8237b85a51", APIVersion:"v1", ResourceVersion:"270426797", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner W0722 20:32:20.449799       1 controller.go:958] Retrying syncing claim "87c4264b-1f16-419a-af99-af8237b85a51", failure 9
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:32:20.449820       1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner I0722 20:32:20.449840       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"cloudops", Name:"data-cloudops-consul-consul-server-1", UID:"87c4264b-1f16-419a-af99-af8237b85a51", APIVersion:"v1", ResourceVersion:"270426797", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner W0722 20:33:21.858061       1 controller.go:958] Retrying syncing claim "87c4264b-1f16-419a-af99-af8237b85a51", failure 10
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner E0722 20:33:21.858082       1 controller.go:981] error syncing claim "87c4264b-1f16-419a-af99-af8237b85a51": failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found
ebs-csi-controller-8c8df9c75-cd7vp csi-provisioner I0722 20:33:21.858143       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"cloudops", Name:"data-cloudops-consul-consul-server-1", UID:"87c4264b-1f16-419a-af99-af8237b85a51", APIVersion:"v1", ResourceVersion:"270426797", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to get target node: node "ip-10-78-100-44.us-west-2.compute.internal" not found

Interestingly it's referring to a node ip-10-78-100-44.us-west-2.compute.internal which doesn't exist anymore. maybe this node was present before the instance refresh.

from worker node:

syslog:

Jul 22 05:06:04 ip-10-78-102-43 kubelet[1152]: E0722 05:06:04.587435    1152 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-03aecee7d16a0e868 podName: nodeName:}" failed. No retries permitted until 2022-07-22 05:07:08.587402313 +0000 UTC m=+139.231499993 (durationBeforeRetry 1m4s). Error: Volume not attached according to node status for volume "pvc-d1360dd8-4648-425a-9da8-4286922fa844" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-03aecee7d16a0e868") pod "prometheus-k8s-pods-1" (UID: "a876f8b9-b696-4a12-a9ae-aeae96502684")

kubelet logs look okay

Jul 22 20:16:57 ip-10-78-102-95 kubelet[2800]: I0722 20:16:57.616596    2800 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: ebs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock
Jul 22 20:17:02 ip-10-78-102-95 kubelet[2800]: I0722 20:17:02.892470    2800 topology_manager.go:200] "Topology Admit Handler"

Policy permissions

            "Action": [
                "ec2:Describe*",
                "ec2:AttachVolume",
                "ec2:DetachVolume",
                "ec2:CreateRoute",
                "ec2:DeleteRoute",
                "ec2:ReplaceRoute",
                "ec2:ModifyInstanceAttribute",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:CreateNetworkInterface",
                "ec2:AttachNetworkInterface",
                "ec2:DeleteNetworkInterface",
                "ec2:DetachNetworkInterface",
                "ec2:AssignPrivateIpAddresses",
                "ec2:UnassignPrivateIpAddresses",
                "ec2:CreateSnapshot",
                "ec2:CreateSnapshots",
                "ec2:DeleteSnapshot",
                "ec2:CreateTags",
                "tag:GetResources",
                "cloudwatch:GetMetricData",
                "cloudwatch:GetMetricStatistics",
                "cloudwatch:ListMetrics"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ],
            "Sid": "Ec2Restrictions"
        },

What you expected to happen? It should be able to attach and detach volumes.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): 1.22.6
  • Driver version: 1.0.0

shomeprasanjit avatar Jul 22 '22 20:07 shomeprasanjit

Resolution: coredns svc was having incorrect label with that of coredns deployment. so EP didn't register any of the coredns IP addresses, even though all pods were running. EBS CSI driver was unable to resolve nodes names causing this issue.

i would expect a better error message which can point to resolution failure rather than just 'ProvisioningFailed' failed to get target node as this also means IAM worker node permission issue.

prasanjitshome avatar Jul 25 '22 07:07 prasanjitshome

I seem to have same issue. I tried changing labels but that didn't appear to work. Which labels exactly did you change to resolve? And what exactly did you change them to? And how do you check if EP has registered coredns IPs?

ilanpillemer avatar Aug 16 '22 06:08 ilanpillemer

Label: k8s-app=coredns earlier it was different.

kubectl describe ep kube-dns -n kube-system should give you registered coredns IP's.

prasanjitshome avatar Aug 16 '22 18:08 prasanjitshome

changing the label made no difference. Still getting similar errors to the above. And when I list the EPs they list correctly (never mind if I change the label or not). I am guessing it is still a DNS error, as its still a timeout for me. Not sure how to analyse the issue further. If I connect to the host or exec into a pod, how I can confirm I can reach the EC2 Api calls? I upgraded one of our clusters to eks 1.23 (luckily only monitoring on this cluster uses ebs volumes, but they are broken). I may have to recreate a 1.22 cluster and migrate across to it, but I dont want to do that. Everything else works, any suggestions for what I should try next?

ilanpillemer avatar Aug 16 '22 22:08 ilanpillemer

ok. I got it working. I edited the deploy files for the csi controllers. I set hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet.

I am using a secondary subnet range for the pods, is that why I needed to do this?

ilanpillemer avatar Aug 18 '22 22:08 ilanpillemer

@ilanpillemer Glad to hear you were able to get it working. dnsPolicy: ClusterFirstWithHostNet changes the DNS resolver to the cluster's DNS which is an interesting point to note, thanks for providing your fix. For anyone that is experiencing this issue, I recommend enabling the --aws-sdk-debug-log=true option to get more insight into whats going on. I've seen this come up a few times and restarting the coreDNS deployment seems to fix it.

torredil avatar Aug 19 '22 15:08 torredil

Thanks @torredil, restarting the coreDNS deloyment did not work for me. The only thing that worked for me was the change to use the Cluster's DNS. However I am using the CNI plugin to allow secondary IP addresses for the pods, as the Primary subnet does not have enough IP addresses to support kubernetes; so I suspect its somehow the interaction between these two plugins that required me to change that setting, which may be unique to the networking in our AWS accounts.

ilanpillemer avatar Aug 19 '22 16:08 ilanpillemer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 17 '22 16:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 17 '22 17:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 16 '23 18:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 16 '23 18:01 k8s-ci-robot