datashim csi-s3/kubelet error " umount: can't unmount /var/lib/kubelet/pods/.*/volumes/kubernetes.io~csi./*/mount: Invalid argument"

csi-s3/kubelet error " umount: can't unmount /var/lib/kubelet/pods/./volumes/kubernetes.io~csi.//mount: Invalid argument"

Open Artebomba opened this issue 11 months ago • 1 comments

What happened?

I am using an s3 bucket as volume for my app running in k8s (deployment, 1 replica, rolling update).

When I triggered the deployment of a new revision of my app, the new pod got up and the s3 bucket attached to the new pod. However, the old pod is failed because it was terminated with an error 137 (probably SIGKILL instead of OOM due to a small gracefull shutdown window, because I don't see any memory related issues right now).

There is an issue with the old pod which is stuck in a terminating state, likely due to a volume problem.

Datashim can not unmount the volume from node where was the old pod. csi-s3 pod (csi-s3 container, daemonset) log:

2024-03-13T21:04:45.001866708Z stderr F I0313 21:04:45.001667       1 utils.go:98] GRPC request: {}
2024-03-13T21:04:45.001885266Z stderr F I0313 21:04:45.001715       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
2024-03-13T21:05:33.635813571Z stderr F I0313 21:05:33.635664       1 utils.go:97] GRPC call: /csi.v1.Node/NodeGetCapabilities
2024-03-13T21:05:33.635874907Z stderr F I0313 21:05:33.635690       1 utils.go:98] GRPC request: {}
2024-03-13T21:05:33.63589224Z stderr F I0313 21:05:33.635736       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
2024-03-13T21:05:37.030960679Z stderr F I0313 21:05:37.027972       1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:37.030981533Z stderr F I0313 21:05:37.027993       1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:37.035823492Z stderr F I0313 21:05:37.035232       1 util.go:75] Found matching pid 87 on path /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:37.035848546Z stderr F I0313 21:05:37.035254       1 mounter.go:80] Found fuse pid 87 of mount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount, checking if it still runs
2024-03-13T21:05:37.035853313Z stderr F I0313 21:05:37.035273       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.135710883Z stderr F I0313 21:05:37.135582       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.336121403Z stderr F I0313 21:05:37.335983       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.636843237Z stderr F I0313 21:05:37.636538       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:38.037307936Z stderr F I0313 21:05:38.037173       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:38.538075749Z stderr F I0313 21:05:38.537942       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:39.13912303Z stderr F I0313 21:05:39.138960       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:39.839917198Z stderr F I0313 21:05:39.839762       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:40.640972203Z stderr F I0313 21:05:40.640842       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:41.542074541Z stderr F I0313 21:05:41.541932       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:42.542234381Z stderr F I0313 21:05:42.542094       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:43.64277079Z stderr F I0313 21:05:43.642634       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:44.843399965Z stderr F I0313 21:05:44.843259       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:46.14365466Z stderr F I0313 21:05:46.143473       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:47.543820527Z stderr F I0313 21:05:47.543692       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:49.044170561Z stderr F I0313 21:05:49.044067       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:50.645274452Z stderr F I0313 21:05:50.645124       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:52.345495246Z stderr F I0313 21:05:52.345336       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:54.145711584Z stderr F I0313 21:05:54.145589       1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:56.046229119Z stderr F E0313 21:05:56.046082       1 utils.go:101] GRPC error: rpc error: code = Internal desc = Timeout waiting for PID 87 to
 end
2024-03-13T21:05:56.625507593Z stderr F I0313 21:05:56.625330       1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:56.625532689Z stderr F I0313 21:05:56.625367       1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-
df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:56.626775Z stderr F E0313 21:05:56.626655       1 utils.go:101] GRPC error: rpc error: code = Internal desc = unmount failed: exit status 1
2024-03-13T21:05:56.626785867Z stderr F Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67
-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:56.626789088Z stderr F Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-
1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
2024-03-13T21:05:57.632955973Z stderr F I0313 21:05:57.632811       1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:57.632982344Z stderr F I0313 21:05:57.632830       1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-
df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:57.634391298Z stderr F E0313 21:05:57.634256       1 utils.go:101] GRPC error: rpc error: code = Internal desc = unmount failed: exit status 1
2024-03-13T21:05:57.634405692Z stderr F Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67
-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:57.634410456Z stderr F Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-
1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument

Onwards, Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc- 1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument repeats indefinitely.

Kubelet produces similar logs on the node where the failed pod is:

-- Logs begin at Wed 2024-03-06 20:07:38 UTC, end at Thu 2024-03-14 13:33:03 UTC. --
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: E0314 13:33:03.729406    3207 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 podName:36180d66-5fa5-4393-a84d-df95afe5a369 nodeName:}" failed. No retries permitted until 2024-03-14 13:35:05.729378976 +0000 UTC m=+667629.606875374 (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "documents-storage-s3" (UniqueName: "kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648") pod "36180d66-5fa5-4393-a84d-df95afe5a369" (UID: "36180d66-5fa5-4393-a84d-df95afe5a369") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = unmount failed: exit status 1
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: I0314 13:33:03.651414    3207 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"documents-storage-s3\" (UniqueName: \"kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648\") pod \"36180d66-5fa5-4393-a84d-df95afe5a369\" (UID: \"36180d66-5fa5-4393-a84d-df95afe5a369\") "
Mar 14 13:31:01 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
Mar 14 13:31:01 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
Mar 14 13:31:01 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: E0314 13:31:01.583136    3207 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 podName:36180d66-5fa5-4393-a84d-df95afe5a369 nodeName:}" failed. No retries permitted until 2024-03-14 13:33:03.583103649 +0000 UTC m=+667507.460600342 (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "documents-storage-s3" (UniqueName: "kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648") pod "36180d66-5fa5-4393-a84d-df95afe5a369" (UID: "36180d66-5fa5-4393-a84d-df95afe5a369") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = unmount failed: exit status 1

Even after I forthfully deleted a failed pod, errors did not disappear.

Pods description: kubectl get pod survey-service-84cf8d9d49-xhbxq -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/restartedAt: "2024-03-06T20:55:56Z"
    <censored>
  creationTimestamp: "2024-03-12T11:19:39Z"
  deletionGracePeriodSeconds: 40
  deletionTimestamp: "2024-03-13T21:05:26Z"
  generateName: survey-service-84cf8d9d49-
  labels:
    app: survey-service
    pod-template-hash: 84cf8d9d49
  name: survey-service-84cf8d9d49-xhbxq
  namespace: test
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: survey-service-84cf8d9d49
    uid: da3867cc-cea3-49bc-b405-4d9d018541ae
  resourceVersion: "125333681"
  uid: 36180d66-5fa5-4393-a84d-df95afe5a369
spec:
  containers:
  - env:
    <censored>
    image: <censored>
    imagePullPolicy: Always
    lifecycle:
      preStop:
        exec:
          command:
          - sh
          - -c
          - sleep 10
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /api/survey/actuator/health/liveness
        port: 8080
        scheme: HTTP
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: survey-service
    ports:
    - containerPort: 8080
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /api/survey/actuator/health/readiness
        port: 8080
        scheme: HTTP
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        memory: 2148Mi
      requests:
        memory: 1630Mi
    startupProbe:
      failureThreshold: 40
      httpGet:
        path: /api/survey/actuator/health/liveness
        port: 8080
        scheme: HTTP
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /proliance360-s3
      name: documents-storage-s3
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: <censored>
      readOnly: true
    - mountPath: <censored>
      name: <censored>
  - args:
    - echo ${VAULT_CONFIG?} | base64 -d > /home/vault/config.json && vault agent -config=/home/vault/config.json
    command:
    - /bin/sh
    - -ec
    env:
    - name: VAULT_LOG_LEVEL
      value: debug
    - name: VAULT_LOG_FORMAT
      value: standard
    - name: VAULT_CONFIG
      value: <censored>
    image: hashicorp/vault:1.13.1
    imagePullPolicy: IfNotPresent
    lifecycle: {}
    name: vault-agent
    resources:
      limits:
        memory: 64Mi
      requests:
        cpu: 25m
        memory: 16Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsGroup: 1000
      runAsNonRoot: true
      runAsUser: 100
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: <censored>
      readOnly: true
    - mountPath: /home/vault
      name: home-sidecar
    - mountPath: /vault/secrets
      name: vault-secrets
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: <censored>
  initContainers:
  - args:
    - echo ${VAULT_CONFIG?} | base64 -d > /home/vault/config.json && vault agent -config=/home/vault/config.json
    command:
    - /bin/sh
    - -ec
    env:
    - name: VAULT_LOG_LEVEL
      value: debug
    - name: VAULT_LOG_FORMAT
      value: standard
    - name: VAULT_CONFIG
      value: <censored>
    image: hashicorp/vault:1.13.1
    imagePullPolicy: IfNotPresent
    name: vault-agent-init
    resources:
      limits:
        memory: 64Mi
      requests:
        cpu: 25m
        memory: 16Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsGroup: 1000
      runAsNonRoot: true
      runAsUser: 100
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/vault
      name: home-init
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: <censored>
      readOnly: true
    - mountPath: /vault/secrets
      name: vault-secrets
  nodeName: <censored>
  preemptionPolicy: Never
  priority: 1000000
  priorityClassName: default-priority-class
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: <censored>
  serviceAccountName: <censored>
  terminationGracePeriodSeconds: 40
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: documents-storage-s3
    persistentVolumeClaim:
      claimName: survey-service-s3-dataset
  - name: <censored>
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
  - emptyDir:
      medium: Memory
    name: home-init
  - emptyDir:
      medium: Memory
    name: home-sidecar
  - emptyDir:
      medium: Memory
    name: vault-secrets
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-12T11:19:43Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T21:05:36Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T21:05:36Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-12T11:19:39Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://54e3be5f2ff3c7e2f672044fcbc979b7d25fed5fadce924fb7917328dfc75713
    image: <censored>
    imageID: <censored>
    lastState: {}
    name: survey-service
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://54e3be5f2ff3c7e2f672044fcbc979b7d25fed5fadce924fb7917328dfc75713
        exitCode: 137
        finishedAt: "2024-03-13T21:05:36Z"
        reason: Error
        startedAt: "2024-03-12T11:19:46Z"
  - containerID: containerd://5b36804afe1c892c8a8c2b85394dbd790f1beb040e2cc0c3cc59806f85440a96
    image: docker.io/hashicorp/vault:1.13.1
    imageID: docker.io/hashicorp/vault@sha256:b888abc3fc0529550d4a6c87884419e86b8cb736fe556e3e717a6bc50888b3b8
    lastState: {}
    name: vault-agent
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://5b36804afe1c892c8a8c2b85394dbd790f1beb040e2cc0c3cc59806f85440a96
        exitCode: 0
        finishedAt: "2024-03-13T21:04:46Z"
        reason: Completed
        startedAt: "2024-03-12T11:19:47Z"
  hostIP: 172.31.30.130
  initContainerStatuses:
  - containerID: containerd://1534bbe8943624f5f30534a4fcaa440fc20ca8c5c4a3fea4510738078dbb29b6
    image: docker.io/hashicorp/vault:1.13.1
    imageID: docker.io/hashicorp/vault@sha256:b888abc3fc0529550d4a6c87884419e86b8cb736fe556e3e717a6bc50888b3b8
    lastState: {}
    name: vault-agent-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://1534bbe8943624f5f30534a4fcaa440fc20ca8c5c4a3fea4510738078dbb29b6
        exitCode: 0
        finishedAt: "2024-03-12T11:19:42Z"
        reason: Completed
        startedAt: "2024-03-12T11:19:42Z"
  phase: Failed
  podIP: 172.31.18.155
  podIPs:
  - ip: 172.31.18.155
  qosClass: Burstable
  startTime: "2024-03-12T11:19:39Z"

When I go to the node where the failed pod is, there is no active fuse filesystem mounted to `/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/*:

[root@ip-172-31-30-130 csi-s3]# df -HT -t fuse
Filesystem           Type  Size  Used Avail Use% Mounted on
proliance360-staging fuse  1.2P     0  1.2P   0% /var/lib/kubelet/pods/c6ae84a5-208e-4e45-8133-442694c9a91b/volumes/kubernetes.io~csi/pvc-96b824dc-b7af-4b78-b08b-9567c4e52942/mount
proliance360-prod    fuse  1.2P     0  1.2P   0% /var/lib/kubelet/pods/366dcbaf-40f7-491a-aa84-801790cd12f4/volumes/kubernetes.io~csi/pvc-7d2b2c22-7134-421d-b2ad-c1a8b8faf376/mount
proliance360-test    fuse  1.2P     0  1.2P   0% /var/lib/kubelet/pods/df5557ff-b00c-4519-9e59-b7497b3b1ddb/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount`

In the csi-s3 container, I may found unfinished goofys process for /var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/ volume:

PID   USER     TIME  COMMAND
    1 root      1:15 /s3driver --v=5 --endpoint=unix:///csi/csi.sock --nodeid=ip-172-31-30-130.eu-central-1.compute.internal
   20 root      0:22 [goofys]
   40 root      0:26 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
PID   USER     TIME  COMMAND
PID   USER     TIME  COMMAND
    1 root      1:15 /s3driver --v=5 --endpoint=unix:///csi/csi.sock --nodeid=ip-172-31-30-130.eu-central-1.compute.internal
   20 root      0:22 [goofys]
   40 root      0:26 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-96b824dc-b7af-4b78-b08b-9567c4e52942 --region eu-central-1 proliance360-staging /var/lib/kubelet/pods/c6ae84a5-208e-
4e45-8133-442694c9a91b/volumes/kubernetes.io~csi/pvc-96b824dc-b7af-4b78-b08b-9567c4e52942/mount
   62 root      3:27 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-7d2b2c22-7134-421d-b2ad-c1a8b8faf376 --region eu-central-1 proliance360-prod /var/lib/kubelet/pods/366dcbaf-40f7-491
a-aa84-801790cd12f4/volumes/kubernetes.io~csi/pvc-7d2b2c22-7134-421d-b2ad-c1a8b8faf376/mount
   87 root      0:13 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 --region eu-central-1 proliance360-test /var/lib/kubelet/pods/36180d66-5fa5-439
3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
  564 root      0:00 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 --region eu-central-1 proliance360-test /var/lib/kubelet/pods/df5557ff-b00c-451
9-9e59-b7497b3b1ddb/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
  607 root      0:00 sh -c clear; (bash || ash || sh)
  614 root      0:00 ash
  616 root      0:00 sh -c clear; (bash || ash || sh)
  624 root      0:00 ash
  652 root      0:00 ps aux
  653 root      0:00 less

The pods directory /var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount actually exists on the node, but It is empty:

/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount # ls -al
total 0
drwxr-x---    2 root     root             6 Mar 12 11:19 .
drwxr-x---    3 root     root            40 Mar 12 11:19 ..

It appears that the old volumes filesystem is not mounted as it is not visible in the output of df -hT -t fuse.

I guess that my pod was stuck in a terminating state because Kubelet can not finish some tasks (maybe admissions controllers involved, garbage collector) and it leaves the pod in this state. I want to fix that.

Worth mentioned that if the pod finished without error (not failed), then no s3 errors/problems occur Thanks in advance.

What did you expect to happen?

Kubelet completely terminates the failed pod. The volume management is healthy.

Kubernetes version

$ kubectl version 
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:11Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.9-eks-5e0fdde", GitCommit:"3f8ed3d5017d988600f597734a4851930eda35a6", GitTreeState:"clean", BuildDate:"2024-01-02T20:34:38Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS EKS 1.27

OS version

# On Linux:
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
$ uname -a
Linux  5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>
```console
$ ctr --version
ctr github.com/containerd/containerd 1.7.11
</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>
DLF Helm chart:
```yaml
apiVersion: v2
name: dlf-chart
description: Dataset Lifecycle Framework chart
type: application
version: 0.1.0
appVersion: 0.1.0

dependencies:
  - name: csi-sidecars-rbac
    version: 0.1.0
    condition: csi-sidecars-rbac.enabled
  - name: csi-nfs-chart
    version: 0.1.0
    condition: csi-nfs-chart.enabled
  - name: csi-s3-chart
    version: 0.1.0
    condition: csi-s3-chart.enabled
  - name: csi-h3-chart
    version: 0.1.0
    condition: csi-h3-chart.enabled
  - name: dataset-operator-chart
    version: 0.1.0
    condition: dataset-operator-chart.enabled

csi-attacher-s3 - image registry.k8s.io/sig-storage/csi-attacher:v3.3.0 ; csi-provisioner-s3 - image registry.k8s.io/sig-storage/csi-provisioner:v2.2.2 csi-s3 - image quay.io/datashim-io/csi-s3:0.3.0 driver-registrar - image registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0 dataset-operator - image quay.io/datashim-io/dataset-operator:0.3.0

Mar 14 '24 23:03 Artebomba

@Artebomba thanks for the detailed report! It seems that you have run into the same problem as #335. What we have been able to find out is that this may be caused by a change in volume attachment introduced in K8s 1.27 and we need to update csi-s3 to reflect this.

We are working on this issue and hope to have an update soon

Mar 15 '24 16:03 srikumar003

datashim datashim copied to clipboard

csi-s3/kubelet error " umount: can't unmount /var/lib/kubelet/pods/.*/volumes/kubernetes.io~csi./*/mount: Invalid argument"

What happened?

What did you expect to happen?

Kubernetes version

Cloud provider

OS version

datashim
datashim copied to clipboard

csi-s3/kubelet error " umount: can't unmount /var/lib/kubelet/pods/./volumes/kubernetes.io~csi.//mount: Invalid argument"