aws-fsx-csi-driver Random "Unable to attach or mount volumes"

Random "Unable to attach or mount volumes"

Open kanor1306 opened this issue 2 years ago • 6 comments

/kind bug

What happened? Pods stuck in ContainerCreating and nodes rendered useless somewhat randomly.

We have been seeing this issue happening occasionally for a couple months (possibly more, since we started to use Lustre, but this only seems to appear when there is a lot going on). It is pretty disrupting in production systems.

The timeline of the issue is:

New pod called podX gets created, mounts multiple Lustre volumes, both static and dynamically created.
podX is scheduled to the a node that has been working fine
podX gets stuck in ContainerCreating status. The events in the pod are these (lustre volume ID and IPs redacted:

 Type     Reason       Age                  From                                                   Message
  ----     ------       ----                 ----                                                   -------
  Warning  FailedMount  42m (x94 over 19h)   kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23714 vol-23715 workflow-token-csshf vol-23713]: timed out waiting for the condition
  Warning  FailedMount  17m (x105 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23715 workflow-token-csshf vol-23713 vol-23714]: timed out waiting for the condition
  Warning  FailedMount  13m (x71 over 19h)   kubelet, ip-x-x-x-x.eu-west-1.compute.internal  MountVolume.SetUp failed for volume "pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e" : rpc error: code = Internal desc = Could not mount "<lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv" at "/var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount": mount failed: exit status 17
Mounting command: mount
Mounting arguments: -t lustre -o flock <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount
Output: mount.lustre: mount <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv at /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount failed: File exists
  Warning  FailedMount  8m1s (x85 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[workflow-token-csshf vol-23713 vol-23714 vol-23715]: timed out waiting for the condition
  Warning  FailedMount  77s (x233 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23713 vol-23714 vol-23715 workflow-token-csshf]: timed out waiting for the condition

Things to note:

Once this happens in a node, multiple pods are affected, usually with similar creation times.
The same issue happens in other nodes, with different Lustre volumes involved (at different times)
The same Lustre volume is being mounted perfectly fine in other nodes.
Not all lustre volumes in the affected node have the issue, some are mounted properly
The node does not recover from this situation, the affected Lustre volumes cannot be mounted anymore in any pod in the node.

What you expected to happen? I expect that at some point the pods stuck in ContainerCreating are able to mount the volume, or else a way to identify that a node is not healthy "lustre-wise"

How to reproduce it (as minimally and precisely as possible)? Don't have a reproduction recipe, it just happens randomly. There is https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/213 which has similar error message, but I am not sure is the same, as in my case does not seem reproducible.

Anything else we need to know?: For context, the manifests of the StorageClass and the PVC, although as I said it happens in both static and dynamically provisioned Lustre volumes. And attached the logs of the driver, and kubelet, at the time of the issue: fsx-driver-kubelet.csv

StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fsx-scratch-sc
mountOptions:
- flock
parameters:
  deploymentType: SCRATCH_2
  securityGroupIds: sg-0b7e22c1b83b54b18
  subnetId: subnet-0b2314c14de6a3ea8
provisioner: fsx.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: fsx.csi.aws.com
  labels:
    app-name: executor
  name: executor-483c6798-pvc
  namespace: workflows
  resourceVersion: "444522977"
  uid: abf3d9ca-ab58-4d82-a2a9-d79768cd038e
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 120000Gi
  storageClassName: fsx-scratch-sc
  volumeMode: Filesystem
  volumeName: pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 120000Gi
  phase: Bound

Environment

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Driver version: v0.8.1 installed via Helm Chart v1.4.1
Sidecars of the driver:

  sidecars:
    livenessProbe:
      image:
        tag: v2.3.0-eks-1-20-11
    nodeDriverRegistrar:
      image:
        tag: v2.2.0-eks-1-20-11
    provisioner:
      image:
        tag: v2.2.2-eks-1-20-11
    resizer:
      image:
        tag: v1.2.0-eks-1-20-11

Mar 15 '22 18:03 kanor1306

https://github.com/kubernetes-sigs/aws-fsx-csi-driver/pull/246 seems to address this, at least partially. Will update to 0.8.2 and see if this problem stops happening.

May 24 '22 11:05 kanor1306

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 22 '22 11:08 k8s-triage-robot

/remove-lifecycle-stale

Aug 22 '22 12:08 kanor1306

We haven't seen it happening again, so I will close this, and re-open if it reoccurs

Aug 22 '22 12:08 kanor1306

Well, this was very unlucky, Closed the issue 3 days ago, and here we are again, it just have happened again. Reopening, will add more details in another message.

Aug 25 '22 10:08 kanor1306

This time, for 4h a node started having problems mounting one particular Lustre volume that was being used without issue in other nodes. The situation is the same as described in the original issue, pods stuck on Init state due to problems mounting a volume. This time the error in the events of the affected pods was:

Mounting command: mount Mounting arguments: -t lustre -o flock fs-04d34cd6cccadfcfc.fsx.eu-west-1.amazonaws.com@tcp:/7levdbmv /var/lib/kubelet/pods/012b41da-50ab-49f1-9a04-43709198d27b/volumes/kubernetes.io~csi/pv-workflow-service-mnt-cache/mount Output: mount.lustre: mount fs-04d34cd6cccadfcfc.fsx.eu-west-1.amazonaws.com@tcp:/7levdbmv at /var/lib/kubelet/pods/012b41da-50ab-49f1-9a04-43709198d27b/volumes/kubernetes.io~csi/pv-workflow-service-mnt-cache/mount failed: File exists

Essentially same as before.

Could someone with expertise in this driver comment whether this problem mounting is something that should be brought up to the Kubernetes main repository instead of this driver? It is not clear to me whether this is a general problem of Kubernetes or specific to the FSx driver.

Aug 25 '22 10:08 kanor1306

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 24 '22 11:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Oct 24 '22 11:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 24 '22 11:10 k8s-ci-robot

/reopen /remove-lifecycle rotten

Dec 05 '22 21:12 jacobwolfaws

@jacobwolfaws: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen /remove-lifecycle rotten

Dec 05 '22 21:12 k8s-ci-robot

/reopen /remove-lifecycle rotten

Dec 05 '22 21:12 olemarkus

@olemarkus: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten

Dec 05 '22 21:12 k8s-ci-robot

Hi @kanor1306, Are you still having this issue?

Dec 09 '22 14:12 jacobwolfaws

Hi @jacobwolfaws, I still get from time to time (around 20 times in the last week in a cluster with a lot of scratch volumes rotation) the error message that I mentioned in https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/244#issuecomment-1227090846.

That said, as I have implemented workarounds with retries and cordoning nodes affected by this issue, I can more or less prevent the problems that this was causing. But next week I will dive on this again and tell you for sure whether I still get pods stuck initializing due to this issue

Dec 09 '22 16:12 kanor1306

@kanor1306 sounds good, let me know!

Dec 09 '22 21:12 jacobwolfaws

@jacobwolfaws I haven't had the time to look into this seriously, and a first search in the logs does not show up any hits to the problem of not being able to mount, so you can close it if you want. I can reopen if I ever see it happening again.

Dec 22 '22 14:12 kanor1306

@kanor1306 Sounds good, it's possible one of our recent images corrected your issue. Closing

Dec 22 '22 14:12 jacobwolfaws

/close

Dec 22 '22 14:12 jacobwolfaws

@jacobwolfaws: Closing this issue.

In response to this:

/close

Dec 22 '22 14:12 k8s-ci-robot

/reopen

Mar 24 '23 14:03 jacobwolfaws

@jacobwolfaws: Reopened this issue.

In response to this:

/reopen

Mar 24 '23 14:03 k8s-ci-robot

Seems to be related: https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/180 I think this issue merits further exploring

That said, as I have implemented workarounds with retries and cordoning nodes affected by this issue, I can more or less prevent the problems that this was causing.

This gives users a short-term potential mitigation, but I'd prefer to figure out the cause and deliver a fix

Mar 24 '23 14:03 jacobwolfaws

I would love to hear more about the mitigations you put into place @kanor1306 , if you are up for sharing! It seems that we are hitting this or something very similar as well.

May 11 '23 02:05 colearendt

@colearendt it is a pretty primitive solution that may not be suited for every case. My use case is a lot of jobs, fault tolerant and that can be retried in a cluster with autoscaling enabled and a high rate of node rotation.

For our case what seems to reduce the impact of this issue is having a CronJob that runs every few minutes and checks for pods that have been in PodInitializing or ContainerCreating for more than x minutes (of course x will depend on your workloads and how long it is acceptable for them to take until they are recognized as problematic). The job then compiles a list of nodes where these problematic pods are running. The idea is that those nodes are not really healthy, as they are having issues mounting Lustre volumes, so the CronJob cordons them and then delete the pods that are stuck in those mentioned two states. The nodes will not run new workloads due to the cordoning and eventually they will go away once the current, non affected, workloads finish.

There is a second case, where pods get stuck in Terminating, which can have unexpected side effects, and that I have the theory that is also caused by the same problem (but less impact, as they workloads have already finished). In this case the CronJob compares the deletion_timestamp with the current time, and if this time has passed and the pod is still Terminating, it gets force deleted. But this is only a acceptable idea if your nodes are not very long lived, else you risk leaving things running unknowingly.

So, as you can see, this is no magic bullet, just a sad workaround to handle this weird occurrences. It has worked for us, but just because our workloads accept this kind of solution (although can add some delays due to the waiting until consider a pod/node faulty).

Hope that it helps, or at least gives some ideas.

May 11 '23 10:05 kanor1306

Thanks @kanor1306 ! That is super helpful! Unfortunately, it is about as painful of a workaround as I expected haha. Our workloads are stateful and interactive (i.e. the user is often coupled to a pod via websockets) with not-super-great retry behavior, which makes this a pretty pernicious issue for us.

I'm glad to hear we are not the only ones with funky terminating behavior too. I bailed on including that in my last message because I thought it might be something particular about our EKS cluster, but we never experienced this sort of thing until we added the FSX CSI. Our messages around cleanup / termination are usually something along the lines of:

May 11 02:29:51 ip-172-19-104-36 kubelet: time="2023-05-11T02:29:51Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/systemd/kubepods/besteffort/podccf3e475-3d22-47c4-88be-3af8a8436257/25ca0331e442fb40f81b0fd64e8e22e4082c1a909f7e8a461c0f919831cca8da: device or resource busy"

Moreover, any attempts I have made at the host level to remedy the situation has failed as well.

It looks like we are going to need to mature our operations a bit to work around this issue. @jacobwolfaws A fix here would be super helpful for us! The available workarounds seem like a bit of a stretch IMO.

May 11 '23 13:05 colearendt

@kanor1306 are you still seeing this error message:

Output: mount.lustre: mount fs-04d34cd6cccadfcfc.fsx.eu-west-1.amazonaws.com@tcp:/7levdbmv at /var/lib/kubelet/pods/012b41da-50ab-49f1-9a04-43709198d27b/volumes/kubernetes.io~csi/pv-workflow-service-mnt-cache/mount failed: File exists

This suggests an idempotency issue when mounting. We should have a check for that here: https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blame/master/pkg/driver/node.go#L104 but if that's not working for you on the latest csi driver we'll need to revisit that.

The terminating issue could be separate or related, if you have logs from that event you're comfortable sharing that might be helpful.

@colearendt Please cut a support case to the AWS team and share the case ID with me over email at [email protected] . I want to collect some additional logs to root cause the issues you're seeing.

May 12 '23 18:05 jacobwolfaws

Hi @jacobwolfaws I don't have access anymore to the system that was having these issues. But I don't recall seeing them anymore while I had access. In my last comment about the issues and the workaround was based in the issues I had in the original post, but the workaround were also hiding the errors, so is difficult to say.

May 19 '23 11:05 kanor1306

@kanor1306 Ah I see. I'm going to close this issue since this is no longer an issue for you. @colearendt please feel free to open a new issue with logs/information regarding the specific issue you're seeing

May 19 '23 13:05 jacobwolfaws

/close

May 19 '23 13:05 jacobwolfaws

aws-fsx-csi-driver aws-fsx-csi-driver copied to clipboard

Random "Unable to attach or mount volumes"

aws-fsx-csi-driver
aws-fsx-csi-driver copied to clipboard