aws-ebs-csi-driver EBS CSI Driver attaches node backed volume instead of PVC

/kind bug

We have

EKS Kubernetes Cluster
AWS EBS CSI driver installed through the AWS managed to add ons (Version v1.4.0-eksbuild.preview)
A Statefullset for our service (Replica of 1)

apiVersion: apps/v1
kind: StatefulSet
metadata:
      containers:
      ....
        volumeMounts:
          - mountPath: /ipfscache
            name: ipfscache
  volumeClaimTemplates:
  - metadata:
      name: ipfscache
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: io2
      resources:
        requests:
          storage: 10000Gi

IO2 StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: io2
provisioner: ebs.csi.aws.com 
parameters:
  type: io2
  fsType: ext4
  iopsPerGB: "10"
volumeBindingMode: WaitForFirstConsumer

A Pod (Part of the Statefullset)
An EBS volume of 10Tb and storage calss of io2 attached (Provisioned dynamically by AWS EBS CSi)
The workload is running on m5.8xlarge instances

What happened?

When a pod is initialized, the EBS volume is successfully attached to the pod. However, sometimes when the pod is deleted and recreated for any reason such as configuration changes to the STS, the EBS volume doesn't attach properly to the Pod. To make the problem worst, the Pod actually starts with an invalid volume mount. The directory path that is expected to be the PVC-backed volume is in fact backed by the EC2 instance volume.

This is the result of running df -h when the pod state is correct.

$ df -h
overlay                 100.0G      5.3G     94.7G   5% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    61.5G         0     61.5G   0% /sys/fs/cgroup
/dev/nvme1n1              9.6T      3.8T      5.8T  39% /ipfscache
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /dev/termination-log
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /etc/resolv.conf
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /etc/hostname
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                    61.5G     12.0K     61.5G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                    61.5G      4.0K     61.5G   0% /run/secrets/eks.amazonaws.com/serviceaccount
tmpfs                    61.5G         0     61.5G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/latency_stats
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                    64.0M         0     64.0M   0% /proc/sched_debug
tmpfs                    61.5G         0     61.5G   0% /sys/firmware

And this is the result when the pod is in a bad state

$ df -h
overlay                 100.0G      5.3G     94.7G   5% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    61.5G         0     61.5G   0% /sys/fs/cgroup
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /ipfscache
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /dev/termination-log
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /etc/resolv.conf
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /etc/hostname
/dev/nvme0n1p1          100.0G      5.3G     94.7G   5% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                    61.5G     12.0K     61.5G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                    61.5G      4.0K     61.5G   0% /run/secrets/eks.amazonaws.com/serviceaccount
tmpfs                    61.5G         0     61.5G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/latency_stats
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                    64.0M         0     64.0M   0% /proc/sched_debug
tmpfs                    61.5G         0     61.5G   0% /sys/firmware

Note the /dev/nvme0n1p1 100.0G 5.3G 94.7G 5% /ipfscache line.

When describing the Pod, we can see the following events

Events:
  Type     Reason       Age                    From               Message
  ----     ------       ----                   ----               -------
  Normal   Scheduled    5m6s                   default-scheduler  Successfully assigned XXXXX to ip-XXXX.us-east-2.compute.internal
  Warning  FailedMount  3m3s                   kubelet            Unable to attach or mount volumes: unmounted volumes=[ipfscache], unattached volumes=[kube-api-access-XXXX aws-iam-token ipfscache]: timed out waiting for the condition
  Warning  FailedMount  2m45s (x7 over 3m17s)  kubelet            MountVolume.SetUp failed for volume "pvc-XXXX" : rpc error: code = Aborted desc = An operation with the given volume="vol-XXXX" is already in progress
  Normal   Pulled       2m12s                  kubelet            Container image "XXXXXXXXXXXX" already present on machine
  Normal   Created      2m12s                  kubelet            Created container XXXXX
  Normal   Started      2m12s                  kubelet            Started container XXXXX

Tailing the EBS plugin pod logs, we can see a few different warnings. Note that the timelines don't exactly match because they are not from the same occurrence of this issue.

...
W0204 19:19:55.218272       1 cloud.go:547] Ignoring error from describe volume for volume "vol-XXXX"; will retry: "RequestCanceled: request context canceled\ncaused by: context canceled"
...
E0204 19:19:55.218315       1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-XXXX" from node "i-XXXX": timed out waiting for the condition
....
W0210 21:17:24.562919       1 cloud.go:480] DetachDisk called on non-attached volume: vol-XXXX

Tailing the CSI attached pod logs, we can see other warnings.

I0210 21:17:24.661529       1 csi_handler.go:281] Failed to save detach error to "csi-XXXX": volumeattachments.storage.k8s.io "csi-XXXX" not found
I0210 21:17:24.661547       1 csi_handler.go:226] Error processing "csi-XXXX": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-XXXX" not found

What you expected to happen?

The pod shouldn't start with invalid volume. This can cause data loss and corruption.

How to reproduce it (as minimally and precisely as possible)?

We can't reproduce it intentionally or pinpoint exactly what sequence of events has to happen before the pod volume is in a bad state. However, it happens at least once or twice a day when there are new configuration changes to the STS. This happens after the pod is deleted and recreated.

From the logs, we speculate that when the pod is deleted and another one is created with new configurations, it takes a while for the volume to detach and attach. This could be due to the volume size we are using but not exactly sure.

Anything else we need to know?:

First, when the pod is in a bad state and I run the kubectl describe on the pod, it still shows that the directory is backed by a PVC. In other words, the pod description shows the PVC attached.

Second, the AWS console and CLI, show the EBS volume as in-use.

Third, after ssh-ing to the K8s worker node, I can confirm that the volume mount is indeed backed by the EC2 instance volume by running df -h on it.

Environment

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Driver version:

v1.4.0-eksbuild.preview

Feb 10 '22 22:02 yznima

Any update here? Also seeing this.

Mar 30 '22 07:03 grobbie

Also seeing something similar - pod was recreated to give it more CPU and memory and getting

 Warning  FailedMount  4m36s                 kubelet, ip-10-102-127-112.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[platform-pulsar-bookkeeper-journal platform-pulsar-bookkeeper-ledgers], unattached volumes=[istio-podinfo platform-pulsar-bookkeeper-journal platform-pulsar-bookkeeper-ledgers platform-pulsar-bookkeeper-token-xczhl istiod-ca-cert istio-data istio-envoy istio-token]: timed out waiting for the condition
  Warning  FailedMount  117s (x3 over 6m1s)   kubelet, ip-10-102-127-112.ec2.internal  MountVolume.MountDevice failed for volume "pvc-4c7b87de-8e27-4faf-94be-7fe719c3b5c6" : rpc error: code = Aborted desc = An operation with the given volume="vol-08ecdf3c15bb04a5f" is already in progress
  Warning  FailedMount  101s (x3 over 5m45s)  kubelet, ip-10-102-127-112.ec2.internal  MountVolume.MountDevice failed for volume "pvc-bf7a407f-4a85-407f-9e14-ca756fe19f25" : rpc error: code = Aborted desc = An operation with the given volume="vol-00ac669e7644edc35" is already in progress

Apr 27 '22 04:04 frankjkelly

We are seeing similar issue on our end

It already caused at least two production outages and we had no luck replicating it on our development environments.

We are using c5d.18xlarge node on EKS v1.21 and with EBS CSI driver v1.6.1

May 23 '22 12:05 RicardsRikmanis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 21 '22 13:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 20 '22 13:09 k8s-triage-robot

/remove-lifecycle-rotten

Sep 20 '22 13:09 grobbie

Any updates on this issue, we are also facing the same issue where it mounts on the node volume instead of pvc/ebs volume. Also ebs-csi-node misses theNodeStageVolume events and directly calling NodePublishVolume events resulting in the node volume mount issue in case the unmount got failed/timedout.

Sep 27 '22 19:09 ankitjain28may

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Oct 27 '22 20:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 27 '22 20:10 k8s-ci-robot

Any update on this? Was anybody able to figure this out? I have been in a loop for about 2 days with the exact same issue (our entire application has been down).

Feb 26 '23 03:02 warleysa

Same here. Application down because of this

Feb 27 '23 07:02 parikshit223933

This is bugging as well. ASG replaced nodes after new AMI was specified. EBS CSI shows PVC-s as bound while AWS console shows volumes as available. This bug is not solved.

Feb 27 '23 13:02 herrbpl

Apr 27 '23 12:04 myroslavmail

I'm facing the same issue. The deployment strategy is Recreate but the volume is failed to detach from old node.

Apr 28 '23 07:04 duclm2609

Was this ever addressed? still seeing this on some volumes

Mar 15 '24 11:03 yuvalavidor

aws-ebs-csi-driver aws-ebs-csi-driver copied to clipboard

EBS CSI Driver attaches node backed volume instead of PVC

aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard