aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard
instance volume limits: workloads no longer attach ebs volumes
/kind bug
What happened? Workloads stop attaching ebs volumes due to reaching instance volume limits, expected number of replicas for our requirement isn’t met and pods are in a pending state.
Nodes have the appropriate limit set to 25 but the scheduler sends more than 25 pods with volumes to a node.
kubelet Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[kube-api-access-redact test-volume]: timed out waiting for the condition
attachdetach-controller AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching
ebs-csi-controller driver.go:119] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching
How to reproduce it (as minimally and precisely as possible)?
Deploying the test below should be sufficient in simulating the problem
apiVersion: v1
kind: Namespace
metadata:
name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: vols
name: vols-pv-test
spec:
selector:
matchLabels:
app: vols-pv-test
serviceName: "vols-pv-tester"
replicas: 60
template:
metadata:
labels:
app: vols-pv-test
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
volumeMounts:
- name: test-volume
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: test-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "******" # something with a reclaim policy of delete
resources:
requests:
storage: 1Gi
Update: adding a liveness probe with an initial delay of 60 seconds seems to get around the problem, our nodes scale, the replica count is correct with volumes attached.
apiVersion: v1
kind: Namespace
metadata:
name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: vols
name: vols-pv-test
spec:
selector:
matchLabels:
app: vols-pv-test
serviceName: "vols-pv-tester"
replicas: 60
template:
metadata:
labels:
app: vols-pv-test
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
livenessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 60
periodSeconds: 10
volumeMounts:
- name: test-volume
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: test-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "******" # something with a reclaim policy of delete
resources:
requests:
storage: 1Gi
Environment
- Kubernetes version: Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
- Version: Helm Chart: v2.6.2 Driver v1.5.0
Hello,
I believe we are seeing this too.
Warning FailedAttachVolume 34s (x11 over 8m47s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-61b4bf2c-541f-4ef1-9f21-redacted" : rpc error: code = Internal desc = Could not attach volume "vol-redacted" to node "i-redacted": attachment of disk "vol-redacted" failed, expected device to be attached but was attaching
...with:
# /bin/kubelet --version
Kubernetes v1.20.11-eks-f17b81
and:
k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.4.0
Thanks,
Phil.
I suspect this is a race condition somewhere, my current thoughts are that it's the scheduler but I haven't had a chance to look at it further.
is CSI driver running with correctly defined limits? What does CSINode
object from node reports?
@gnufied the CSI driver looks to be doing everything correctly, AFAIK the only thing it needs to do is report the max PV attachments it can make. As reported above if you add in latency between pod scheduling the pods are sent to nodes with space for the PV mounts which is why I suspect it's a scheduler issue.
@stevehipwell No that shouldn't happen. We start counting volumes against the limit before pods are even started on the node. I am still waiting on output of CSINode
object from the problematic node.
@gnufied I agree that the CSI driver is reporting correctly, which combined with the 60s wait fixing the issue makes me believe that this issue is actually happening elsewhere as a race condition.
I am seeing the same problem in my environment:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m5s default-scheduler Successfully assigned 2d3b9d81e0b0/master-0 to ip-10-3-109-222.ec2.internal
Warning FailedMount 2s kubelet, ip-10-3-109-222.ec2.internal Unable to attach or mount volumes: unmounted volumes=[master], unattached volumes=[master backups filebeat-configuration default-token-vjscf]: timed out waiting for the condition
I haven't been able to confirm if they are related, but I see this happening on the same node that has eni's interfaces in "attaching" state and I see the following errors in /var/log/aws-routed-eni/ipamd.log
:
{"level":"error","ts":"2022-02-01T19:45:34.410Z","caller":"ipamd/ipamd.go:805","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI: error attaching ENI: attachENI: failed to attach ENI:AttachmentLimitExceeded: Interface count 9 exceeds the limit for c5.4xlarge\n\tstatus code: 400, request id: 836ce9b1-ec63-4935-a007-739e32f506cb"}
For reference, the c5.x4xlarge instance type in AWS supports 8 eni's.
We are testing why we can limit it using the volume-attach-limit
variable which is not set yet, but I would like to understand first why it happens and if there is a way to not hardcode that value.
Environment
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
[user@admin [~] ~]$ kubectl -n kube-system get deployment ebs-csi-controller -o wide -o yaml | grep "image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver"
image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
[user@admin [~] ~]$
[root@i-6666919f09cc78046 ~]# /usr/bin/kubelet --version
Kubernetes v1.19.15-eks-9c63c4
[root@i-6666919f09cc78046 ~]#
Hello,
... update from our side:
Our first simple workaround as we first observed the problem yesterday (might help others who are stuck and looking for a 'quick fix'):
... cordon the current node that the pod is stuck in 'Init ...' on. ... delete the pod ... ... verify that the pod is started successfully on an alternative node. If not ... ... repeat 'cordoning' until the pod is successfully deployed. ... uncorden (all) node(s) upon successful deployment.
Then following a dive into the CSI EBS driver code, we passed the option '--volume-attach-limit=50' to the 'node driver'. I haven't tested this explicitly yet however.
The problem to me seems to be a missing feedback loop between the 'node driver' and the scheduler.
The scheduler says, "Hey, there's a node that satisfies my scheduling criteria ... I'll schedule the workload to run there ..." and the node driver says, "OK, I have a workload but I've reached this '25 attached volumes' limit so I'm done here ...".
This is just my perhaps primitive view of the situation.
Thanks,
Phil.
PS ... following a re-deployment of the 'csi ebs node driver' we are still seeing the attribute 'attachable-volumes-aws-ebs' as set to 25 on a 'describe node':
... we weren't expecting this.
@ryanpxyz looking at the code I think the CSI just reports how many attachments it can make. Until the PR to make this dynamic is merged and released this is a fixed value by instance type or arg. This means there are two related but distinct issues.
The first is the incorrect max value that doesn't take into accoun all nitro instances and their other attachments. For example a nitro instance (only 5 series) and no arg will have a limit of 25, which is correct as long as you only have 3 extra attachments. If you're using custom networking and prefixes this means instances without an additional NVMe drive work but ones with this get stuck.
The second problem, which is what this issue is tracking, is that when meeting the criteria for a correctly reported max it is still possible that too many pods will be scheduled on a node.
@sultanovich see my reply above about the attachment limits. I think there is a separate issue and PR for resolving your problem.
@gnufied output from the workers while running the original tests with limits
Name: ip-**-**-**-**.eu-west-1.compute.internal
Labels: <none>
Annotations: storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp: Thu, 17 Feb 2022 07:34:19 +0000
Spec:
Drivers:
ebs.csi.aws.com:
Node ID: i-redacted
Allocatables:
Count: 25
Topology Keys: [topology.ebs.csi.aws.com/zone]
Events: <none>
Name: ip-**-**-**-**.eu-west-1.compute.internal
Labels: <none>
Annotations: storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp: Thu, 17 Feb 2022 07:44:10 +0000
Spec:
Drivers:
ebs.csi.aws.com:
Node ID: i-redacted
Allocatables:
Count: 25
Topology Keys: [topology.ebs.csi.aws.com/zone]
Events: <none>
@stevehipwell I have no doubts as to the limits that can be annexed. My question is about why it happens and how to solve it.
I have generated a new issue (#1174) since the volume-attach-limit
argument has not worked for me either.
Perhaps the inconvenience of the eni 's in the attaching
state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.
@sultanovich I think I've explained pretty well everything I know about this issue. Let me reiterate that there are two bugs here, the first one which is related to nodes not being picked up as nitro or not having 25 free attachment slots is being addressed by https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1075, the second and currently unexplained is related to the speed at which requests for pods with PVs are sent to the scheduler. The second scenario is what this issue was opened for, with your new issue there are now a number of other issues relating to the first scenario.
Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.
The current driver doesn't take any dynamic attachments into consideration, you get 25 if the node is detected as nitro or 39 if not. If you are getting failures on nitro instance that isn't a 5 series, or has NVMe drives or is using more than 2 ENIs you should be able to statically fix the problem by using the --volume-attach-limit
argument. If you're using a m5
instance but requesting lots of PVs it's likely that you're seeing this issue; you should be able to stop it happening by changing your deployment strategy and adding a wait between pods.
@ryanpxyz you are looking at wrong place for attachable limits of CSI driver. Attach limit of CSI driver is reported via CSINode
objects. if we are not rebuilding CSINode
objects during redeploy of driver - that sounds like a bug. So setting --volume-attach-limit
and redeploying driver should set correct limits.
As for bug in scheduler - here is the code for counting the limits https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L210 . Its been awhile since I looked in to the scheduler code, but if scheduler is not respecting limits reported by CSINode
then that would be a k/k bug (and we are going to need one).
@bertinatto - are you aware of a bug where if many pods are scheduled at once to a node then scheduler may not correctly count the volume limits?
@gnufied it looks like it's the Filter function that is doing the work we're interested in. Unless only a single pod can be scheduled at a time, which is unlikely, this code looks like it isn't checking for other in flight requests and could easily result in over provisioning volumes on a node.
I would expect to see something to lock a CSINode so only one calculation at a time could run, but I might be missing something here as I'm not really familiar with this part of the codebase?
As an aside would supporting Storage Capacity Tracking help limit the blast radius of this issue?
@gnufied I tried setting the --volume-attach-limit
argument in a test environment and it worked fine.
The only limitation that I find is that it applies to the entire cluster, if I have nodes, the other types of instances in AWS could limit the number of volumes that I can host, increasing infrastructure costs.
Do you have any idea how long it might take to modify this check to take the correct limits on all instance types?
@sultanovich this issue isn't the same as #1174, please don't confuse them.
@gnufied @bertinatto do you have any more thoughts on this? I doubt I've read the code correctly so would appreciate someone looking at the code I mentioned above to see if they can see the same potential issue?
On further testing of this it looks like this has been fixed via an EKS platform version update (I suspect), I'd be interested if anyone knows what exactly was fixed?
@stevehipwell The EKS AMI changelog for the most recent v20220406
release had one interesting note that might be relevant:
The bootstrap script will auto-discover maxPods values when instanceType is missing in eni-max-pods.txt
@jrsdav thanks for looking out but that functionality sets a kubelet arg (incorrectly in most cases) and isn't related to storage attachments. This issue wasn't ever about the correct max value being set for attachments, that's a separate issue with a fix coming in the next minor version, it was a scheduling issue that didn't make much sense.
We're experiencing this issue as well. Except, from some of the discussion above it sounds like people think it's some kind of scheduling race condition. In our case, it seems like the volume attachments are never being properly counted. We have a node with 25 attachments, but the Allocated Resources section under kubectl describe node
show zero attachments:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 35170m (73%) 38 (79%)
memory 130316Mi (68%) 131564Mi (68%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Any leads on what might be causing that to happen?
Again looks like you are looking at wrong object. CSI volume limits are counted via CSINode
objects. So please check what value it is reporting.
@gnufied Ah, okay. Thank you. It looks like the "allocatables" are indeed being properly counted, which I guess puts us in the race condition boat:
k describe csinode ip-172-20-60-87.ec2.internal
Name: ip-172-20-60-87.ec2.internal
Labels: <none>
Annotations: <none>
CreationTimestamp: Wed, 27 Apr 2022 05:12:27 -0400
Spec:
Drivers:
ebs.csi.aws.com:
Node ID: i-0f37978c6d1e25a52
Allocatables:
Count: 25
Topology Keys: [topology.ebs.csi.aws.com/zone topology.kubernetes.io/zone]
Events: <none>
@gnufied , actually is "Allocatables" just the total limit? How do I see what it thinks is currently allocated?
We are using csi volumes and in-tree volumes at the same time and see similar errors. Even if csi volumes are counted correctly, there are also non csi volumes attached to the nodes which result in the underlying node limited to be exceeded. Is this situation addressed by any of the linked issues?
@gnufied csinode reports total bullshit as well
Spec:
Drivers:
efs.csi.aws.com:
Node ID: i-0cf67b141b4d31d04
ebs.csi.aws.com:
Node ID: i-0cf67b141b4d31d04
Allocatables:
Count: 39
Topology Keys: [topology.ebs.csi.aws.com/zone]
And it fails exactly after 25 volumes.
Okay, after browsing through code it looks like it uses a lot of "heuristics" based on some AWS docs and not on the actual truth on the field. Raised a support ticket with AWS to try to find out what's going on and why that information is not available through metadata, for example.
Experiencing similar EBS scheduling issues here, while I think multiple bugs/problems are mixed in discussions in this interesting ticket.
My situation (and 2 cents):
- AWS EKS version 1.23.9. Recent EKS optimized AMI v20220824.
- EBS CSI driver version: 1.11.2
- Using CSI in backward compat mode (not sure if it matters, probably not)
- This is a 'real live' situation, I'm not following OP example of a single statefulset which could benefit from the workaround of a initialDelaySeconds parameter to stage pod rollout. Separate pods are stuck together in my situation.
- I believe the occurrence is intermittent, depending on pod scheduling pressure on the node (?).
This happens when a new node joins, Nitro type.
CSINode reports OK:
Name: ip-10-18-56-105.eu-central-1.compute.internal
Labels: <none>
Annotations: storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/cinder,kubernetes.io/gce-pd
CreationTimestamp: Mon, 05 Sep 2022 11:06:32 +0200
Spec:
Drivers:
ebs.csi.aws.com:
Node ID: i-0680a3a4beff784e8
Allocatables:
Count: 25
Topology Keys: [topology.ebs.csi.aws.com/zone]
Events: <none>
Pods are being scheduled, up to 24 working volumes. Then there are 2 additional volume stuck in Attaching
state. Technically only 1 of those could be attached given the Nitro limits.
I therefore suspect some sort of race condition. This might be identical to the more narrowly defined issue #1278, but not fully sure.
Manual workaround to get pods unstuck is to cordon the node and kill the stuck pods, then the stuck PVC will be moved away. Subsequently you can uncordon.