aws-ebs-csi-driver icon indicating copy to clipboard operation
aws-ebs-csi-driver copied to clipboard

instance volume limits: workloads no longer attach ebs volumes

Open aydosman opened this issue 2 years ago • 60 comments

/kind bug

What happened? Workloads stop attaching ebs volumes due to reaching instance volume limits, expected number of replicas for our requirement isn’t met and pods are in a pending state.

Nodes have the appropriate limit set to 25 but the scheduler sends more than 25 pods with volumes to a node.

kubelet Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[kube-api-access-redact test-volume]: timed out waiting for the condition

attachdetach-controller AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

ebs-csi-controller driver.go:119] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

How to reproduce it (as minimally and precisely as possible)?

Deploying the test below should be sufficient in simulating the problem

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Update: adding a liveness probe with an initial delay of 60 seconds seems to get around the problem, our nodes scale, the replica count is correct with volumes attached.

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 60
          periodSeconds: 10          
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Environment

  • Kubernetes version: Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
  • Version: Helm Chart: v2.6.2 Driver v1.5.0

aydosman avatar Feb 04 '22 11:02 aydosman

Hello,

I believe we are seeing this too.

Warning  FailedAttachVolume  34s (x11 over 8m47s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-61b4bf2c-541f-4ef1-9f21-redacted" : rpc error: code = Internal desc = Could not attach volume "vol-redacted" to node "i-redacted": attachment of disk "vol-redacted" failed, expected device to be attached but was attaching 

...with:

# /bin/kubelet --version
Kubernetes v1.20.11-eks-f17b81

and:

k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.4.0

Thanks,

Phil.

ryanpxyz avatar Feb 16 '22 13:02 ryanpxyz

I suspect this is a race condition somewhere, my current thoughts are that it's the scheduler but I haven't had a chance to look at it further.

stevehipwell avatar Feb 16 '22 16:02 stevehipwell

is CSI driver running with correctly defined limits? What does CSINode object from node reports?

gnufied avatar Feb 16 '22 16:02 gnufied

@gnufied the CSI driver looks to be doing everything correctly, AFAIK the only thing it needs to do is report the max PV attachments it can make. As reported above if you add in latency between pod scheduling the pods are sent to nodes with space for the PV mounts which is why I suspect it's a scheduler issue.

stevehipwell avatar Feb 16 '22 17:02 stevehipwell

@stevehipwell No that shouldn't happen. We start counting volumes against the limit before pods are even started on the node. I am still waiting on output of CSINode object from the problematic node.

gnufied avatar Feb 16 '22 17:02 gnufied

@gnufied I agree that the CSI driver is reporting correctly, which combined with the 60s wait fixing the issue makes me believe that this issue is actually happening elsewhere as a race condition.

stevehipwell avatar Feb 16 '22 17:02 stevehipwell

I am seeing the same problem in my environment:

Events:
  Type     Reason       Age   From                                   Message
  ----     ------       ----  ----                                   -------
  Normal   Scheduled    2m5s  default-scheduler                      Successfully assigned 2d3b9d81e0b0/master-0 to ip-10-3-109-222.ec2.internal
  Warning  FailedMount  2s    kubelet, ip-10-3-109-222.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[master], unattached volumes=[master backups filebeat-configuration default-token-vjscf]: timed out waiting for the condition

I haven't been able to confirm if they are related, but I see this happening on the same node that has eni's interfaces in "attaching" state and I see the following errors in /var/log/aws-routed-eni/ipamd.log:

{"level":"error","ts":"2022-02-01T19:45:34.410Z","caller":"ipamd/ipamd.go:805","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI: error attaching ENI: attachENI: failed to attach ENI:AttachmentLimitExceeded: Interface count 9 exceeds the limit for c5.4xlarge\n\tstatus code: 400, request id: 836ce9b1-ec63-4935-a007-739e32f506cb"}

For reference, the c5.x4xlarge instance type in AWS supports 8 eni's.

We are testing why we can limit it using the volume-attach-limit variable which is not set yet, but I would like to understand first why it happens and if there is a way to not hardcode that value.

Environment

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
[user@admin [~] ~]$ kubectl -n kube-system get deployment ebs-csi-controller -o wide -o yaml | grep "image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver"
        image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
[user@admin [~] ~]$
[root@i-6666919f09cc78046 ~]# /usr/bin/kubelet --version
Kubernetes v1.19.15-eks-9c63c4
[root@i-6666919f09cc78046 ~]#

sultanovich avatar Feb 16 '22 18:02 sultanovich

Hello,

... update from our side:

Our first simple workaround as we first observed the problem yesterday (might help others who are stuck and looking for a 'quick fix'):

... cordon the current node that the pod is stuck in 'Init ...' on. ... delete the pod ... ... verify that the pod is started successfully on an alternative node. If not ... ... repeat 'cordoning' until the pod is successfully deployed. ... uncorden (all) node(s) upon successful deployment.

Then following a dive into the CSI EBS driver code, we passed the option '--volume-attach-limit=50' to the 'node driver'. I haven't tested this explicitly yet however.

The problem to me seems to be a missing feedback loop between the 'node driver' and the scheduler.

The scheduler says, "Hey, there's a node that satisfies my scheduling criteria ... I'll schedule the workload to run there ..." and the node driver says, "OK, I have a workload but I've reached this '25 attached volumes' limit so I'm done here ...".

This is just my perhaps primitive view of the situation.

Thanks,

Phil.

PS ... following a re-deployment of the 'csi ebs node driver' we are still seeing the attribute 'attachable-volumes-aws-ebs' as set to 25 on a 'describe node':

image

... we weren't expecting this.

ryanpxyz avatar Feb 17 '22 05:02 ryanpxyz

@ryanpxyz looking at the code I think the CSI just reports how many attachments it can make. Until the PR to make this dynamic is merged and released this is a fixed value by instance type or arg. This means there are two related but distinct issues.

The first is the incorrect max value that doesn't take into accoun all nitro instances and their other attachments. For example a nitro instance (only 5 series) and no arg will have a limit of 25, which is correct as long as you only have 3 extra attachments. If you're using custom networking and prefixes this means instances without an additional NVMe drive work but ones with this get stuck.

The second problem, which is what this issue is tracking, is that when meeting the criteria for a correctly reported max it is still possible that too many pods will be scheduled on a node.

stevehipwell avatar Feb 17 '22 05:02 stevehipwell

@sultanovich see my reply above about the attachment limits. I think there is a separate issue and PR for resolving your problem.

stevehipwell avatar Feb 17 '22 05:02 stevehipwell

@gnufied output from the workers while running the original tests with limits

Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:34:19 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>



Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:44:10 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>

aydosman avatar Feb 17 '22 08:02 aydosman

@stevehipwell I have no doubts as to the limits that can be annexed. My question is about why it happens and how to solve it.

I have generated a new issue (#1174) since the volume-attach-limit argument has not worked for me either.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

sultanovich avatar Feb 21 '22 17:02 sultanovich

@sultanovich I think I've explained pretty well everything I know about this issue. Let me reiterate that there are two bugs here, the first one which is related to nodes not being picked up as nitro or not having 25 free attachment slots is being addressed by https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1075, the second and currently unexplained is related to the speed at which requests for pods with PVs are sent to the scheduler. The second scenario is what this issue was opened for, with your new issue there are now a number of other issues relating to the first scenario.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

The current driver doesn't take any dynamic attachments into consideration, you get 25 if the node is detected as nitro or 39 if not. If you are getting failures on nitro instance that isn't a 5 series, or has NVMe drives or is using more than 2 ENIs you should be able to statically fix the problem by using the --volume-attach-limit argument. If you're using a m5 instance but requesting lots of PVs it's likely that you're seeing this issue; you should be able to stop it happening by changing your deployment strategy and adding a wait between pods.

stevehipwell avatar Feb 21 '22 17:02 stevehipwell

@ryanpxyz you are looking at wrong place for attachable limits of CSI driver. Attach limit of CSI driver is reported via CSINode objects. if we are not rebuilding CSINode objects during redeploy of driver - that sounds like a bug. So setting --volume-attach-limit and redeploying driver should set correct limits.

As for bug in scheduler - here is the code for counting the limits https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L210 . Its been awhile since I looked in to the scheduler code, but if scheduler is not respecting limits reported by CSINode then that would be a k/k bug (and we are going to need one).

gnufied avatar Feb 21 '22 17:02 gnufied

@bertinatto - are you aware of a bug where if many pods are scheduled at once to a node then scheduler may not correctly count the volume limits?

gnufied avatar Feb 21 '22 17:02 gnufied

@gnufied it looks like it's the Filter function that is doing the work we're interested in. Unless only a single pod can be scheduled at a time, which is unlikely, this code looks like it isn't checking for other in flight requests and could easily result in over provisioning volumes on a node.

I would expect to see something to lock a CSINode so only one calculation at a time could run, but I might be missing something here as I'm not really familiar with this part of the codebase?

As an aside would supporting Storage Capacity Tracking help limit the blast radius of this issue?

stevehipwell avatar Feb 22 '22 09:02 stevehipwell

@gnufied I tried setting the --volume-attach-limit argument in a test environment and it worked fine. The only limitation that I find is that it applies to the entire cluster, if I have nodes, the other types of instances in AWS could limit the number of volumes that I can host, increasing infrastructure costs.

Do you have any idea how long it might take to modify this check to take the correct limits on all instance types?

sultanovich avatar Mar 14 '22 13:03 sultanovich

@sultanovich this issue isn't the same as #1174, please don't confuse them.

stevehipwell avatar Mar 15 '22 16:03 stevehipwell

@gnufied @bertinatto do you have any more thoughts on this? I doubt I've read the code correctly so would appreciate someone looking at the code I mentioned above to see if they can see the same potential issue?

stevehipwell avatar Mar 29 '22 09:03 stevehipwell

On further testing of this it looks like this has been fixed via an EKS platform version update (I suspect), I'd be interested if anyone knows what exactly was fixed?

stevehipwell avatar Apr 12 '22 14:04 stevehipwell

@stevehipwell The EKS AMI changelog for the most recent v20220406 release had one interesting note that might be relevant:

The bootstrap script will auto-discover maxPods values when instanceType is missing in eni-max-pods.txt

jrsdav avatar Apr 20 '22 16:04 jrsdav

@jrsdav thanks for looking out but that functionality sets a kubelet arg (incorrectly in most cases) and isn't related to storage attachments. This issue wasn't ever about the correct max value being set for attachments, that's a separate issue with a fix coming in the next minor version, it was a scheduling issue that didn't make much sense.

stevehipwell avatar Apr 20 '22 16:04 stevehipwell

We're experiencing this issue as well. Except, from some of the discussion above it sounds like people think it's some kind of scheduling race condition. In our case, it seems like the volume attachments are never being properly counted. We have a node with 25 attachments, but the Allocated Resources section under kubectl describe node show zero attachments:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests        Limits
  --------                    --------        ------
  cpu                         35170m (73%)    38 (79%)
  memory                      130316Mi (68%)  131564Mi (68%)
  ephemeral-storage           0 (0%)          0 (0%)
  hugepages-1Gi               0 (0%)          0 (0%)
  hugepages-2Mi               0 (0%)          0 (0%)
  attachable-volumes-aws-ebs  0               0

Any leads on what might be causing that to happen?

LeeHampton avatar Apr 29 '22 17:04 LeeHampton

Again looks like you are looking at wrong object. CSI volume limits are counted via CSINode objects. So please check what value it is reporting.

gnufied avatar Apr 29 '22 17:04 gnufied

@gnufied Ah, okay. Thank you. It looks like the "allocatables" are indeed being properly counted, which I guess puts us in the race condition boat:

k describe csinode  ip-172-20-60-87.ec2.internal


Name:               ip-172-20-60-87.ec2.internal
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Wed, 27 Apr 2022 05:12:27 -0400
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-0f37978c6d1e25a52
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone topology.kubernetes.io/zone]
Events:               <none>

LeeHampton avatar Apr 29 '22 17:04 LeeHampton

@gnufied , actually is "Allocatables" just the total limit? How do I see what it thinks is currently allocated?

LeeHampton avatar Apr 29 '22 17:04 LeeHampton

We are using csi volumes and in-tree volumes at the same time and see similar errors. Even if csi volumes are counted correctly, there are also non csi volumes attached to the nodes which result in the underlying node limited to be exceeded. Is this situation addressed by any of the linked issues?

Legion2 avatar May 07 '22 10:05 Legion2

@gnufied csinode reports total bullshit as well

Spec:
  Drivers:
    efs.csi.aws.com:
      Node ID:  i-0cf67b141b4d31d04
    ebs.csi.aws.com:
      Node ID:  i-0cf67b141b4d31d04
      Allocatables:
        Count:        39
      Topology Keys:  [topology.ebs.csi.aws.com/zone]

And it fails exactly after 25 volumes.

pkit avatar Aug 18 '22 14:08 pkit

Okay, after browsing through code it looks like it uses a lot of "heuristics" based on some AWS docs and not on the actual truth on the field. Raised a support ticket with AWS to try to find out what's going on and why that information is not available through metadata, for example.

pkit avatar Aug 18 '22 16:08 pkit

Experiencing similar EBS scheduling issues here, while I think multiple bugs/problems are mixed in discussions in this interesting ticket.

My situation (and 2 cents):

  • AWS EKS version 1.23.9. Recent EKS optimized AMI v20220824.
  • EBS CSI driver version: 1.11.2
  • Using CSI in backward compat mode (not sure if it matters, probably not)
  • This is a 'real live' situation, I'm not following OP example of a single statefulset which could benefit from the workaround of a initialDelaySeconds parameter to stage pod rollout. Separate pods are stuck together in my situation.
  • I believe the occurrence is intermittent, depending on pod scheduling pressure on the node (?).

This happens when a new node joins, Nitro type.

CSINode reports OK:

Name:               ip-10-18-56-105.eu-central-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/cinder,kubernetes.io/gce-pd
CreationTimestamp:  Mon, 05 Sep 2022 11:06:32 +0200
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-0680a3a4beff784e8
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>

Pods are being scheduled, up to 24 working volumes. Then there are 2 additional volume stuck in Attaching state. Technically only 1 of those could be attached given the Nitro limits. I therefore suspect some sort of race condition. This might be identical to the more narrowly defined issue #1278, but not fully sure.

Manual workaround to get pods unstuck is to cordon the node and kill the stuck pods, then the stuck PVC will be moved away. Subsequently you can uncordon.

jortkoopmans avatar Sep 05 '22 14:09 jortkoopmans