aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard
Node can't access volumeattachments resource
/kind bug
What happened? When a pod with a persistent volume is deleted, the new pod fails to attach / mount the storage with the following error:
MountVolume.WaitForAttach failed for volume "<pvc_name>" : volume <volume_name> has GET error
for volume attachment csi-4b2c0c56...: volumeattachments.storage.k8s.io "csi-4b2c0c56..." is forbidden: User "system:node:<node_name>" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node "<node_name>" and this object
What you expected to happen? Volume should move to new pod and successfully mount.
How to reproduce it (as minimally and precisely as possible)?
- Create a deployment with a pod that mounts a PVC provisioned by AWS EBS CSI Driver
- Delete the pod
- Describe the new pod and see the message specified above. It is usually the next message after "Multi-attach failure", which is an expected message while the original pod is being deleted.
Anything else we need to know?:
- This error has been intermittent, and seen with both new volumes and "migrated" ones.
- This error occurs in a cluster with the aws cloud provider running out-of-tree (which doesn't include volume provisioning logic)
Environment
- Kubernetes version (use
kubectl version): 1.15.1 - Driver version: commit 2aed4b5
What's the spec.nodename of the volumeattachment object "csi-4b2c0c56..." and does it match the actual "<node_name>"? Authorization to get volumeattachments is handled by the node authorizer https://github.com/kubernetes/kubernetes/pull/58360/.
In this case, the volumeattachment object doesn't exist. From the digging I did into the node authorizer, it looks like it's returning a "not authorized" if the attachment isn't found.
kubernetes is the one that creates/deletes volumeattachments so there might be something in the logs if you turn the controller-manager logging verbosity up to 4 like "detacher deleted ok VolumeAttachment.ID" https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_attacher.go#L448
Oh, that is good to know. I will take a look at that if I see this error again. Thanks!
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
I'm seeing this pretty often now.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Was there any solution this ever? We ran into this just today
I'm seeing this pretty often now.
also with latest csi ebs drivers? we just updated.
Is there any updates? we also see same problem
Hey @christianhuening @fzyzcjy @vlerenc, can you provide environment details and specify the driver version being used?
As previously mentioned, authorization to get volumeattachments is handled by the node authorizer so we'll also need to take a look at the controller logs. The volumeattachment object should be present and the spec.nodename of the volumeattachment object should be equal to "<node_name>" in the error message logged.
Is there a workaround for this? Deleting the affected pod didn't help. The volumeattachment object doesn't exist in my case.
also with latest csi ebs drivers? we just updated.
we are hitting this as well...
@christianhuening - did upgrading fix the issue for you? We are currently running chart v2.6.8 was thinking of upgrading to 2.10.1 to catch the upgrades in 2.6.9 etc but avoid having to deal with the CSIDriver shuffle introduced in 2.11.0 right now.
We didn't see the issue since, so I'd say yes.
This does appear to fix the issue for us (so far). We upgraded to chart version 2.12.1 and app version 1.12.1.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/close
This issue has been resolved. If you run into this please report it and specify K8s / driver versions, thanks.
I'm running into this issue with v1.10 of the driver. From reading the messages above it seems like upgrading will likely fix the issue. I don't see anything in the changelog that's obviously related to this issue though. Can anyone point me to the relevant change?