datashim
datashim copied to clipboard
csi-s3/kubelet error " umount: can't unmount /var/lib/kubelet/pods/.*/volumes/kubernetes.io~csi./*/mount: Invalid argument"
What happened?
I am using an s3 bucket as volume for my app running in k8s (deployment, 1 replica, rolling update).
When I triggered the deployment of a new revision of my app, the new pod got up and the s3 bucket attached to the new pod. However, the old pod is failed because it was terminated with an error 137 (probably SIGKILL instead of OOM due to a small gracefull shutdown window, because I don't see any memory related issues right now).
There is an issue with the old pod which is stuck in a terminating state, likely due to a volume problem.
Datashim can not unmount the volume from node where was the old pod. csi-s3 pod (csi-s3 container, daemonset) log:
2024-03-13T21:04:45.001866708Z stderr F I0313 21:04:45.001667 1 utils.go:98] GRPC request: {}
2024-03-13T21:04:45.001885266Z stderr F I0313 21:04:45.001715 1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
2024-03-13T21:05:33.635813571Z stderr F I0313 21:05:33.635664 1 utils.go:97] GRPC call: /csi.v1.Node/NodeGetCapabilities
2024-03-13T21:05:33.635874907Z stderr F I0313 21:05:33.635690 1 utils.go:98] GRPC request: {}
2024-03-13T21:05:33.63589224Z stderr F I0313 21:05:33.635736 1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
2024-03-13T21:05:37.030960679Z stderr F I0313 21:05:37.027972 1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:37.030981533Z stderr F I0313 21:05:37.027993 1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:37.035823492Z stderr F I0313 21:05:37.035232 1 util.go:75] Found matching pid 87 on path /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:37.035848546Z stderr F I0313 21:05:37.035254 1 mounter.go:80] Found fuse pid 87 of mount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount, checking if it still runs
2024-03-13T21:05:37.035853313Z stderr F I0313 21:05:37.035273 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.135710883Z stderr F I0313 21:05:37.135582 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.336121403Z stderr F I0313 21:05:37.335983 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.636843237Z stderr F I0313 21:05:37.636538 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:38.037307936Z stderr F I0313 21:05:38.037173 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:38.538075749Z stderr F I0313 21:05:38.537942 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:39.13912303Z stderr F I0313 21:05:39.138960 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:39.839917198Z stderr F I0313 21:05:39.839762 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:40.640972203Z stderr F I0313 21:05:40.640842 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:41.542074541Z stderr F I0313 21:05:41.541932 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:42.542234381Z stderr F I0313 21:05:42.542094 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:43.64277079Z stderr F I0313 21:05:43.642634 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:44.843399965Z stderr F I0313 21:05:44.843259 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:46.14365466Z stderr F I0313 21:05:46.143473 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:47.543820527Z stderr F I0313 21:05:47.543692 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:49.044170561Z stderr F I0313 21:05:49.044067 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:50.645274452Z stderr F I0313 21:05:50.645124 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:52.345495246Z stderr F I0313 21:05:52.345336 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:54.145711584Z stderr F I0313 21:05:54.145589 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:56.046229119Z stderr F E0313 21:05:56.046082 1 utils.go:101] GRPC error: rpc error: code = Internal desc = Timeout waiting for PID 87 to
end
2024-03-13T21:05:56.625507593Z stderr F I0313 21:05:56.625330 1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:56.625532689Z stderr F I0313 21:05:56.625367 1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-
df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:56.626775Z stderr F E0313 21:05:56.626655 1 utils.go:101] GRPC error: rpc error: code = Internal desc = unmount failed: exit status 1
2024-03-13T21:05:56.626785867Z stderr F Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67
-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:56.626789088Z stderr F Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-
1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
2024-03-13T21:05:57.632955973Z stderr F I0313 21:05:57.632811 1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:57.632982344Z stderr F I0313 21:05:57.632830 1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-
df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:57.634391298Z stderr F E0313 21:05:57.634256 1 utils.go:101] GRPC error: rpc error: code = Internal desc = unmount failed: exit status 1
2024-03-13T21:05:57.634405692Z stderr F Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67
-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:57.634410456Z stderr F Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-
1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
Onwards, Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc- 1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
repeats indefinitely.
Kubelet produces similar logs on the node where the failed pod is:
-- Logs begin at Wed 2024-03-06 20:07:38 UTC, end at Thu 2024-03-14 13:33:03 UTC. --
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: E0314 13:33:03.729406 3207 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 podName:36180d66-5fa5-4393-a84d-df95afe5a369 nodeName:}" failed. No retries permitted until 2024-03-14 13:35:05.729378976 +0000 UTC m=+667629.606875374 (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "documents-storage-s3" (UniqueName: "kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648") pod "36180d66-5fa5-4393-a84d-df95afe5a369" (UID: "36180d66-5fa5-4393-a84d-df95afe5a369") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = unmount failed: exit status 1
Mar 14 13:33:03 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: I0314 13:33:03.651414 3207 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"documents-storage-s3\" (UniqueName: \"kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648\") pod \"36180d66-5fa5-4393-a84d-df95afe5a369\" (UID: \"36180d66-5fa5-4393-a84d-df95afe5a369\") "
Mar 14 13:31:01 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
Mar 14 13:31:01 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
Mar 14 13:31:01 ip-172-31-30-130.eu-central-1.compute.internal kubelet[3207]: E0314 13:31:01.583136 3207 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 podName:36180d66-5fa5-4393-a84d-df95afe5a369 nodeName:}" failed. No retries permitted until 2024-03-14 13:33:03.583103649 +0000 UTC m=+667507.460600342 (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "documents-storage-s3" (UniqueName: "kubernetes.io/csi/ch.ctrox.csi.s3-driver^pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648") pod "36180d66-5fa5-4393-a84d-df95afe5a369" (UID: "36180d66-5fa5-4393-a84d-df95afe5a369") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = unmount failed: exit status 1
Even after I forthfully deleted a failed pod, errors did not disappear.
Pods description: kubectl get pod survey-service-84cf8d9d49-xhbxq -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2024-03-06T20:55:56Z"
<censored>
creationTimestamp: "2024-03-12T11:19:39Z"
deletionGracePeriodSeconds: 40
deletionTimestamp: "2024-03-13T21:05:26Z"
generateName: survey-service-84cf8d9d49-
labels:
app: survey-service
pod-template-hash: 84cf8d9d49
name: survey-service-84cf8d9d49-xhbxq
namespace: test
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: survey-service-84cf8d9d49
uid: da3867cc-cea3-49bc-b405-4d9d018541ae
resourceVersion: "125333681"
uid: 36180d66-5fa5-4393-a84d-df95afe5a369
spec:
containers:
- env:
<censored>
image: <censored>
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- sh
- -c
- sleep 10
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/survey/actuator/health/liveness
port: 8080
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: survey-service
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/survey/actuator/health/readiness
port: 8080
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 2148Mi
requests:
memory: 1630Mi
startupProbe:
failureThreshold: 40
httpGet:
path: /api/survey/actuator/health/liveness
port: 8080
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /proliance360-s3
name: documents-storage-s3
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: <censored>
readOnly: true
- mountPath: <censored>
name: <censored>
- args:
- echo ${VAULT_CONFIG?} | base64 -d > /home/vault/config.json && vault agent -config=/home/vault/config.json
command:
- /bin/sh
- -ec
env:
- name: VAULT_LOG_LEVEL
value: debug
- name: VAULT_LOG_FORMAT
value: standard
- name: VAULT_CONFIG
value: <censored>
image: hashicorp/vault:1.13.1
imagePullPolicy: IfNotPresent
lifecycle: {}
name: vault-agent
resources:
limits:
memory: 64Mi
requests:
cpu: 25m
memory: 16Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 1000
runAsNonRoot: true
runAsUser: 100
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: <censored>
readOnly: true
- mountPath: /home/vault
name: home-sidecar
- mountPath: /vault/secrets
name: vault-secrets
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: <censored>
initContainers:
- args:
- echo ${VAULT_CONFIG?} | base64 -d > /home/vault/config.json && vault agent -config=/home/vault/config.json
command:
- /bin/sh
- -ec
env:
- name: VAULT_LOG_LEVEL
value: debug
- name: VAULT_LOG_FORMAT
value: standard
- name: VAULT_CONFIG
value: <censored>
image: hashicorp/vault:1.13.1
imagePullPolicy: IfNotPresent
name: vault-agent-init
resources:
limits:
memory: 64Mi
requests:
cpu: 25m
memory: 16Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 1000
runAsNonRoot: true
runAsUser: 100
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /home/vault
name: home-init
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: <censored>
readOnly: true
- mountPath: /vault/secrets
name: vault-secrets
nodeName: <censored>
preemptionPolicy: Never
priority: 1000000
priorityClassName: default-priority-class
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: <censored>
serviceAccountName: <censored>
terminationGracePeriodSeconds: 40
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: documents-storage-s3
persistentVolumeClaim:
claimName: survey-service-s3-dataset
- name: <censored>
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
- emptyDir:
medium: Memory
name: home-init
- emptyDir:
medium: Memory
name: home-sidecar
- emptyDir:
medium: Memory
name: vault-secrets
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-12T11:19:43Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-03-13T21:05:36Z"
reason: PodFailed
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-03-13T21:05:36Z"
reason: PodFailed
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-03-12T11:19:39Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://54e3be5f2ff3c7e2f672044fcbc979b7d25fed5fadce924fb7917328dfc75713
image: <censored>
imageID: <censored>
lastState: {}
name: survey-service
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://54e3be5f2ff3c7e2f672044fcbc979b7d25fed5fadce924fb7917328dfc75713
exitCode: 137
finishedAt: "2024-03-13T21:05:36Z"
reason: Error
startedAt: "2024-03-12T11:19:46Z"
- containerID: containerd://5b36804afe1c892c8a8c2b85394dbd790f1beb040e2cc0c3cc59806f85440a96
image: docker.io/hashicorp/vault:1.13.1
imageID: docker.io/hashicorp/vault@sha256:b888abc3fc0529550d4a6c87884419e86b8cb736fe556e3e717a6bc50888b3b8
lastState: {}
name: vault-agent
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://5b36804afe1c892c8a8c2b85394dbd790f1beb040e2cc0c3cc59806f85440a96
exitCode: 0
finishedAt: "2024-03-13T21:04:46Z"
reason: Completed
startedAt: "2024-03-12T11:19:47Z"
hostIP: 172.31.30.130
initContainerStatuses:
- containerID: containerd://1534bbe8943624f5f30534a4fcaa440fc20ca8c5c4a3fea4510738078dbb29b6
image: docker.io/hashicorp/vault:1.13.1
imageID: docker.io/hashicorp/vault@sha256:b888abc3fc0529550d4a6c87884419e86b8cb736fe556e3e717a6bc50888b3b8
lastState: {}
name: vault-agent-init
ready: true
restartCount: 0
state:
terminated:
containerID: containerd://1534bbe8943624f5f30534a4fcaa440fc20ca8c5c4a3fea4510738078dbb29b6
exitCode: 0
finishedAt: "2024-03-12T11:19:42Z"
reason: Completed
startedAt: "2024-03-12T11:19:42Z"
phase: Failed
podIP: 172.31.18.155
podIPs:
- ip: 172.31.18.155
qosClass: Burstable
startTime: "2024-03-12T11:19:39Z"
When I go to the node where the failed pod is, there is no active fuse filesystem mounted to `/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/*:
[root@ip-172-31-30-130 csi-s3]# df -HT -t fuse
Filesystem Type Size Used Avail Use% Mounted on
proliance360-staging fuse 1.2P 0 1.2P 0% /var/lib/kubelet/pods/c6ae84a5-208e-4e45-8133-442694c9a91b/volumes/kubernetes.io~csi/pvc-96b824dc-b7af-4b78-b08b-9567c4e52942/mount
proliance360-prod fuse 1.2P 0 1.2P 0% /var/lib/kubelet/pods/366dcbaf-40f7-491a-aa84-801790cd12f4/volumes/kubernetes.io~csi/pvc-7d2b2c22-7134-421d-b2ad-c1a8b8faf376/mount
proliance360-test fuse 1.2P 0 1.2P 0% /var/lib/kubelet/pods/df5557ff-b00c-4519-9e59-b7497b3b1ddb/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount`
In the csi-s3 container, I may found unfinished goofys process for /var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/
volume:
PID USER TIME COMMAND
1 root 1:15 /s3driver --v=5 --endpoint=unix:///csi/csi.sock --nodeid=ip-172-31-30-130.eu-central-1.compute.internal
20 root 0:22 [goofys]
40 root 0:26 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
PID USER TIME COMMAND
PID USER TIME COMMAND
1 root 1:15 /s3driver --v=5 --endpoint=unix:///csi/csi.sock --nodeid=ip-172-31-30-130.eu-central-1.compute.internal
20 root 0:22 [goofys]
40 root 0:26 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-96b824dc-b7af-4b78-b08b-9567c4e52942 --region eu-central-1 proliance360-staging /var/lib/kubelet/pods/c6ae84a5-208e-
4e45-8133-442694c9a91b/volumes/kubernetes.io~csi/pvc-96b824dc-b7af-4b78-b08b-9567c4e52942/mount
62 root 3:27 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-7d2b2c22-7134-421d-b2ad-c1a8b8faf376 --region eu-central-1 proliance360-prod /var/lib/kubelet/pods/366dcbaf-40f7-491
a-aa84-801790cd12f4/volumes/kubernetes.io~csi/pvc-7d2b2c22-7134-421d-b2ad-c1a8b8faf376/mount
87 root 0:13 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 --region eu-central-1 proliance360-test /var/lib/kubelet/pods/36180d66-5fa5-439
3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
564 root 0:00 /bin/goofys --endpoint=https://s3.eu-central-1.amazonaws.com --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --ht
tp-timeout 5m -o allow_other --profile=pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648 --region eu-central-1 proliance360-test /var/lib/kubelet/pods/df5557ff-b00c-451
9-9e59-b7497b3b1ddb/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
607 root 0:00 sh -c clear; (bash || ash || sh)
614 root 0:00 ash
616 root 0:00 sh -c clear; (bash || ash || sh)
624 root 0:00 ash
652 root 0:00 ps aux
653 root 0:00 less
The pods directory /var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
actually exists on the node, but It is empty:
/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount # ls -al
total 0
drwxr-x--- 2 root root 6 Mar 12 11:19 .
drwxr-x--- 3 root root 40 Mar 12 11:19 ..
It appears that the old volumes filesystem is not mounted as it is not visible in the output of df -hT -t fuse
.
I guess that my pod was stuck in a terminating state because Kubelet can not finish some tasks (maybe admissions controllers involved, garbage collector) and it leaves the pod in this state. I want to fix that.
Worth mentioned that if the pod finished without error (not failed), then no s3 errors/problems occur Thanks in advance.
What did you expect to happen?
Kubelet completely terminates the failed pod. The volume management is healthy.
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:11Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.9-eks-5e0fdde", GitCommit:"3f8ed3d5017d988600f597734a4851930eda35a6", GitTreeState:"clean", BuildDate:"2024-01-02T20:34:38Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
AWS EKS 1.27
OS version
# On Linux:
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
$ uname -a
Linux 5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
</details>
### Install tools
<details>
</details>
### Container runtime (CRI) and version (if applicable)
<details>
```console
$ ctr --version
ctr github.com/containerd/containerd 1.7.11
</details>
### Related plugins (CNI, CSI, ...) and versions (if applicable)
<details>
DLF Helm chart:
```yaml
apiVersion: v2
name: dlf-chart
description: Dataset Lifecycle Framework chart
type: application
version: 0.1.0
appVersion: 0.1.0
dependencies:
- name: csi-sidecars-rbac
version: 0.1.0
condition: csi-sidecars-rbac.enabled
- name: csi-nfs-chart
version: 0.1.0
condition: csi-nfs-chart.enabled
- name: csi-s3-chart
version: 0.1.0
condition: csi-s3-chart.enabled
- name: csi-h3-chart
version: 0.1.0
condition: csi-h3-chart.enabled
- name: dataset-operator-chart
version: 0.1.0
condition: dataset-operator-chart.enabled
csi-attacher-s3 - image registry.k8s.io/sig-storage/csi-attacher:v3.3.0 ; csi-provisioner-s3 - image registry.k8s.io/sig-storage/csi-provisioner:v2.2.2 csi-s3 - image quay.io/datashim-io/csi-s3:0.3.0 driver-registrar - image registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0 dataset-operator - image quay.io/datashim-io/dataset-operator:0.3.0
@Artebomba thanks for the detailed report! It seems that you have run into the same problem as #335. What we have been able to find out is that this may be caused by a change in volume attachment introduced in K8s 1.27 and we need to update csi-s3 to reflect this.
We are working on this issue and hope to have an update soon