csi-driver-smb icon indicating copy to clipboard operation
csi-driver-smb copied to clipboard

Mount point deletion issues causing CreateContainerError on pod

Open seif-raafat opened this issue 3 years ago • 9 comments

What happened:

Mount issues causing CreateContainerError on pod:

Failed to generate container 'xxyy' spec: failed to generate spec: failed to mkdir "/var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount": mkdir "/var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount" file exists

Looking at the driver logs it shows that there is a mount already existing and tries to delete it but fails... the cycle keeps repeating.

I0528 19:11:34.336620       1 nodeserver.go:85] NodePublishVolume: mount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/globalmount at /var/lib/kubelet/pods/34a51e04-81fb-412e-85a8-9a343cedc6bb/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount volumeID(corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554) successfully
I0528 19:11:34.336644       1 utils.go:83] GRPC response: {}
I0528 19:11:40.734494       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0528 19:11:40.734527       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/97094c31-59ad-4690-9359-53d9dea547d4/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount","volume_id":"corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554"}
I0528 19:11:40.734617       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554 on /var/lib/kubelet/pods/97094c31-59ad-4690-9359-53d9dea547d4/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount
I0528 19:11:40.734638       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/97094c31-59ad-4690-9359-53d9dea547d4/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount
W0528 19:11:40.804126       1 mount_helper_common.go:133] Warning: "/var/lib/kubelet/pods/97094c31-59ad-4690-9359-53d9dea547d4/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount" is not a mountpoint, deleting
I0528 19:11:40.804244       1 nodeserver.go:105] NodeUnpublishVolume: unmount volume corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554 on /var/lib/kubelet/pods/97094c31-59ad-4690-9359-53d9dea547d4/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount successfully
I0528 19:11:40.804262       1 utils.go:83] GRPC response: {}
I0528 20:08:14.208478       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0528 20:08:14.208506       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/34a51e04-81fb-412e-85a8-9a343cedc6bb/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount","volume_id":"corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554"}
I0528 20:08:14.208591       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554 on /var/lib/kubelet/pods/34a51e04-81fb-412e-85a8-9a343cedc6bb/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount
I0528 20:08:14.208612       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/34a51e04-81fb-412e-85a8-9a343cedc6bb/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount
W0528 20:08:14.265403       1 mount_helper_common.go:133] Warning: "/var/lib/kubelet/pods/34a51e04-81fb-412e-85a8-9a343cedc6bb/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount" is not a mountpoint, deleting
I0528 20:08:14.266071       1 nodeserver.go:105] NodeUnpublishVolume: unmount volume corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554 on /var/lib/kubelet/pods/34a51e04-81fb-412e-85a8-9a343cedc6bb/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount successfully
I0528 20:08:14.266094       1 utils.go:83] GRPC response: {}
I0528 20:08:20.908469       1 utils.go:76] GRPC call: /csi.v1.Node/NodePublishVolume
I0528 20:08:20.908495       1 utils.go:77] GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/globalmount","target_path":"/var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount","volume_capability":{"AccessType":{"Mount":{"mount_flags":["vers=2.1","file_mode=0775","dir_mode=0775","uid=1000","gid=1000","domain=corp"]}},"access_mode":{"mode":5}},"volume_context":{"csi.storage.k8s.io/ephemeral":"false","csi.storage.k8s.io/pod.name":"demo-cc98b5bf8-x86xb","csi.storage.k8s.io/pod.namespace":"team-demo-prod","csi.storage.k8s.io/pod.uid":"ea38ea2f-03af-459e-8038-24b46e4aadc1","csi.storage.k8s.io/serviceAccount.name":"default","source":"//corp.example.com/dfs","storage.kubernetes.io/csiProvisionerIdentity":"1646688438914-8081-smb.csi.k8s.io"},"volume_id":"corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554"}
I0528 20:08:20.939141       1 nodeserver.go:78] NodePublishVolume: mounting /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/globalmount at /var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount with mountOptions: [bind] volumeID(corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554)
I0528 20:08:20.939199       1 mount_linux.go:183] Mounting cmd (mount) with arguments ( -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/globalmount /var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount)
I0528 20:08:20.946339       1 mount_linux.go:183] Mounting cmd (mount) with arguments ( -o bind,remount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/globalmount /var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount)
I0528 20:08:20.950624       1 nodeserver.go:85] NodePublishVolume: mount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/globalmount at /var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount volumeID(corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554) successfully
I0528 20:08:20.950657       1 utils.go:83] GRPC response: {}
I0528 20:32:43.584571       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0528 20:32:43.584595       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount","volume_id":"corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554"}
I0528 20:32:43.584950       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554 on /var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount
I0528 20:32:43.585037       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount
W0528 20:32:43.660202       1 mount_helper_common.go:133] Warning: "/var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount" is not a mountpoint, deleting
I0528 20:32:43.660321       1 nodeserver.go:105] NodeUnpublishVolume: unmount volume corp.example.com/dfs#pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554 on /var/lib/kubelet/pods/ea38ea2f-03af-459e-8038-24b46e4aadc1/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount successfully

What you expected to happen:

  • provide log message why it couldn't delete the mount and recreate it
  • maybe create a separate mount after several tries

How to reproduce it:

Not able to reproduce so far, but the issue did not go away after recreating the pod or even deleting the deployment. It successfully mounted when the pod was scheduled on a different node.

Anything else we need to know?:

Environment:

  • CSI Driver version: v1.6.0
  • Kubernetes version (use kubectl version): v1.21.5
  • OS (e.g. from /etc/os-release): Ubuntu 18.04

seif-raafat avatar Jun 02 '22 21:06 seif-raafat

I think os.Remove(mountPath) returns success though that dir is not deleted. could you check why /var/lib/kubelet/pods/97094c31-59ad-4690-9359-53d9dea547d4/volumes/kubernetes.io~csi/pvc-aa0842c8-daf2-4924-8fbd-d32f627ca554/mount failed silently on that node?

andyzhangx avatar Jun 04 '22 09:06 andyzhangx

Hi @andyzhangx thanks for the quick response! I don't have access to that node anymore since it got replaced, but I have set an alert to notify me when this happens again so I can troubleshoot it further. What steps do you suggest I take in order to check why it failed silently if it happens again?

seif-raafat avatar Jun 06 '22 15:06 seif-raafat

Hi @andyzhangx ! The issue has come up again with a similar but not exactly the same error message as posted previously. This time I do not see any error messages on the CSI driver logs. I am still getting the CreateContainerError on the pod and can verify them from the kubelet logs. Any ideas on this?

csi smb driver daemon set pod logs

I0617 20:35:50.014716       1 main.go:90] set up prometheus server on [::]:29645
I0617 20:35:50.015356       1 smb.go:80] 
DRIVER INFORMATION:
-------------------
Build Date: "2022-05-02T08:19:59Z"
Compiler: gc
Driver Name: smb.csi.k8s.io
Driver Version: v1.6.0
Git Commit: ""
Go Version: go1.17.3
Platform: linux/amd64

Streaming logs below:
I0617 20:35:50.015451       1 mount_linux.go:208] Detected OS without systemd
I0617 20:35:50.015458       1 driver.go:93] Enabling controller service capability: CREATE_DELETE_VOLUME
I0617 20:35:50.015462       1 driver.go:93] Enabling controller service capability: SINGLE_NODE_MULTI_WRITER
I0617 20:35:50.015465       1 driver.go:112] Enabling volume access mode: SINGLE_NODE_WRITER
I0617 20:35:50.015468       1 driver.go:112] Enabling volume access mode: SINGLE_NODE_READER_ONLY
I0617 20:35:50.015471       1 driver.go:112] Enabling volume access mode: SINGLE_NODE_SINGLE_WRITER
I0617 20:35:50.015473       1 driver.go:112] Enabling volume access mode: SINGLE_NODE_MULTI_WRITER
I0617 20:35:50.015475       1 driver.go:112] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0617 20:35:50.015478       1 driver.go:112] Enabling volume access mode: MULTI_NODE_SINGLE_WRITER
I0617 20:35:50.015480       1 driver.go:112] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0617 20:35:50.015482       1 driver.go:103] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0617 20:35:50.015485       1 driver.go:103] Enabling node service capability: SINGLE_NODE_MULTI_WRITER
I0617 20:35:50.015488       1 driver.go:103] Enabling node service capability: VOLUME_MOUNT_GROUP
I0617 20:35:50.015611       1 server.go:118] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I0617 20:35:50.220860       1 utils.go:76] GRPC call: /csi.v1.Identity/GetPluginInfo
I0617 20:35:50.220881       1 utils.go:77] GRPC request: {}
I0617 20:35:50.222195       1 utils.go:83] GRPC response: {"name":"smb.csi.k8s.io","vendor_version":"v1.6.0"}
I0617 20:35:50.547281       1 utils.go:76] GRPC call: /csi.v1.Identity/GetPluginInfo
I0617 20:35:50.547297       1 utils.go:77] GRPC request: {}
I0617 20:35:50.547370       1 utils.go:83] GRPC response: {"name":"smb.csi.k8s.io","vendor_version":"v1.6.0"}
I0617 20:35:50.940214       1 utils.go:76] GRPC call: /csi.v1.Node/NodeGetInfo
I0617 20:35:50.940231       1 utils.go:77] GRPC request: {}
I0617 20:35:50.940275       1 utils.go:83] GRPC response: {"node_id":"kon-shared-w2-core-nonprod-a2-u1-p2-658d885695-2jjw9"}
I0617 21:11:07.889950       1 utils.go:76] GRPC call: /csi.v1.Node/NodeStageVolume
I0617 21:11:07.889971       1 utils.go:77] GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount","volume_capability":{"AccessType":{"Mount":{"mount_flags":["vers=3","file_mode=0775","dir_mode=0775","uid=1000","gid=1000","domain=corp"]}},"access_mode":{"mode":5}},"volume_context":{"source":"//corp.demo.com/dfs","storage.kubernetes.io/csiProvisionerIdentity":"1654548771536-8081-smb.csi.k8s.io"},"volume_id":"corp.demo.com/dfs#pvc-91da69c7-a5e0-406e-93ed-1575abf5565f#"}
I0617 21:11:07.890261       1 nodeserver.go:187] NodeStageVolume: targetPath(/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount) volumeID(corp.demo.com/dfs#pvc-91da69c7-a5e0-406e-93ed-1575abf5565f#) context(map[source://corp.demo.com/dfs storage.kubernetes.io/csiProvisionerIdentity:1654548771536-8081-smb.csi.k8s.io]) mountflags([vers=3 file_mode=0775 dir_mode=0775 uid=1000 gid=1000 domain=corp]) mountOptions([vers=3 file_mode=0775 dir_mode=0775 uid=1000 gid=1000 domain=corp])
I0617 21:11:07.891369       1 mount_linux.go:183] Mounting cmd (mount) with arguments (-t cifs -o vers=3,file_mode=0775,dir_mode=0775,uid=1000,gid=1000,domain=corp,<masked> //corp.demo.com/dfs /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount)
I0617 21:11:08.053410       1 nodeserver.go:216] volume(corp.demo.com/dfs#pvc-91da69c7-a5e0-406e-93ed-1575abf5565f#) mount "//corp.demo.com/dfs" on "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount" succeeded
I0617 21:11:08.053439       1 utils.go:83] GRPC response: {}
I0617 21:11:08.297000       1 utils.go:76] GRPC call: /csi.v1.Node/NodePublishVolume
I0617 21:11:08.297035       1 utils.go:77] GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount","target_path":"/var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount","volume_capability":{"AccessType":{"Mount":{"mount_flags":["vers=3","file_mode=0775","dir_mode=0775","uid=1000","gid=1000","domain=corp"]}},"access_mode":{"mode":5}},"volume_context":{"csi.storage.k8s.io/ephemeral":"false","csi.storage.k8s.io/pod.name":"demo-8f6698cdb-rkj9r","csi.storage.k8s.io/pod.namespace":"team-demo-uat","csi.storage.k8s.io/pod.uid":"f3b83910-f1ed-48f7-8024-058be1efdea6","csi.storage.k8s.io/serviceAccount.name":"default","source":"//corp.demo.com/dfs","storage.kubernetes.io/csiProvisionerIdentity":"1654548771536-8081-smb.csi.k8s.io"},"volume_id":"corp.demo.com/dfs#pvc-91da69c7-a5e0-406e-93ed-1575abf5565f#"}
I0617 21:11:08.300370       1 nodeserver.go:78] NodePublishVolume: mounting /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount at /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount with mountOptions: [bind] volumeID(corp.demo.com/dfs#pvc-91da69c7-a5e0-406e-93ed-1575abf5565f#)
I0617 21:11:08.300399       1 mount_linux.go:183] Mounting cmd (mount) with arguments ( -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount)
I0617 21:11:08.301843       1 mount_linux.go:183] Mounting cmd (mount) with arguments ( -o bind,remount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount)
I0617 21:11:08.303692       1 nodeserver.go:85] NodePublishVolume: mount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount at /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount volumeID(corp.demo.com/dfs#pvc-91da69c7-a5e0-406e-93ed-1575abf5565f#) successfully
I0617 21:11:08.303711       1 utils.go:83] GRPC response: {}

kubelet error logs

Jul 03 18:02:21 node-p2-658d885695-2jjw9 kubelet[2760]: E0703 18:02:21.161869    2760 remote_runtime.go:228] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to generate container \"2636dac4304fad5b43b30b4ac012a0ad0553362e90b07f6763ff554a19ca3e5f\" spec: failed to generate spec: failed to mkdir \"/var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount\": mkdir /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount: file exists" podSandboxID="333430c68e9db4c5032ac0d497e72d89f2a125767538bd05be09eea44145e17a"
Jul 03 18:02:21 node-p2-658d885695-2jjw9 kubelet[2760]: E0703 18:02:21.162163    2760 kuberuntime_manager.go:870] container &Container{...} start failed in pod demo-8f6698cdb-rkj9r_team-demo-uat(f3b83910-f1ed-48f7-8024-058be1efdea6): CreateContainerError: failed to generate container "2636dac4304fad5b43b30b4ac012a0ad0553362e90b07f6763ff554a19ca3e5f" spec: failed to generate spec: failed to mkdir "/var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount": mkdir /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount: file exists
Jul 03 18:02:21 node-p2-658d885695-2jjw9 kubelet[2760]: E0703 18:02:21.162235    2760 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"demo\" with CreateContainerError: \"failed to generate container \\\"2636dac4304fad5b43b30b4ac012a0ad0553362e90b07f6763ff554a19ca3e5f\\\" spec: failed to generate spec: failed to mkdir \\\"/var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount\\\": mkdir /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount: file exists\"" pod="team-demo-uat/demo-8f6698cdb-rkj9r" podUID=f3b83910-f1ed-48f7-8024-058be1efdea6

seif-raafat avatar Jul 05 '22 19:07 seif-raafat

the mount path /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount was mounted successfully in 6/17, why kubelet would try to mount that path again in 7/3?

andyzhangx avatar Jul 06 '22 07:07 andyzhangx

Confused about that as well. Here's some info from the node:

root@node-p2-658d885695-2jjw9:~# ls /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount
ls: cannot access '/var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount': No such file or directory
root@node-p2-658d885695-2jjw9:~# ls -lah /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f
ls: cannot access '/var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount': No such file or directory
total 12K
drwxr-x--- 3 root root 4.0K Jun 17 21:11 .
drwxr-x--- 3 root root 4.0K Jun 17 21:11 ..
d????????? ? ?    ?       ?            ? mount
-rw-r--r-- 1 root root  350 Jun 17 21:11 vol_data.json
root@node-p2-658d885695-2jjw9:~# stat /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount
stat: cannot stat '/var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount': No such file or directory
root@node-p2-658d885695-2jjw9:~# mount | grep cifs
//corp.demo.com/dfs on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount type cifs (rw,relatime,vers=3,cache=strict,username=***stripped***,domain=corp,uid=1000,forceuid,gid=1000,forcegid,addr=***stripped***,file_mode=0775,dir_mode=0775,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)
//corp.demo.com/dfs on /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount type cifs (rw,relatime,vers=3,cache=strict,username=***stripped***,domain=corp,uid=1000,forceuid,gid=1000,forcegid,addr=***stripped***,file_mode=0775,dir_mode=0775,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)

Could this be a stale file handle @andyzhangx ?

seif-raafat avatar Jul 06 '22 14:07 seif-raafat

@seif-raafat that means the smb connection is broken, new mount works?

andyzhangx avatar Jul 07 '22 02:07 andyzhangx

Hi @andyzhangx ! Yeah, I was able to create a new mount on the same node successfully. How could we mitigate this issue, as it does not resolve on its own? We have to manually recreate the pod to cause it to create a new mount.

root@node-p2-658d885695-2jjw9:~# ls -lah /var/lib/kubelet/pods/67f04126-4f32-41e2-b308-a51a0b7b80da/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/
total 20K
drwxr-x--- 3 root   root   4.0K Jul  7 14:36 .
drwxr-x--- 3 root   root   4.0K Jul  7 14:36 ..
drwxrwxr-x 2 anthos anthos 8.0K Apr 29 04:19 mount
-rw-r--r-- 1 root   root    350 Jul  7 14:36 vol_data.json
root@node-p2-658d885695-2jjw9:~# stat /var/lib/kubelet/pods/67f04126-4f32-41e2-b308-a51a0b7b80da/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/
  File: /var/lib/kubelet/pods/67f04126-4f32-41e2-b308-a51a0b7b80da/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d	Inode: 4144384     Links: 3
Access: (0750/drwxr-x---)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-07-07 14:39:09.180301559 +0000
Modify: 2022-07-07 14:36:35.856362809 +0000
Change: 2022-07-07 14:36:35.856362809 +0000
 Birth: -
root@node-p2-658d885695-2jjw9:~# mount | grep cifs
//corp.demo.com/dfs on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/globalmount type cifs (rw,relatime,vers=3,cache=strict,username=***stripped***,domain=corp,uid=1000,forceuid,gid=1000,forcegid,addr=***stripped***,file_mode=0775,dir_mode=0775,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)
//corp.demo.com/dfs on /var/lib/kubelet/pods/f3b83910-f1ed-48f7-8024-058be1efdea6/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount type cifs (rw,relatime,vers=3,cache=strict,username=***stripped***,domain=corp,uid=1000,forceuid,gid=1000,forcegid,addr=***stripped***,file_mode=0775,dir_mode=0775,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)
//corp.demo.com/dfs on /var/lib/kubelet/pods/67f04126-4f32-41e2-b308-a51a0b7b80da/volumes/kubernetes.io~csi/pvc-91da69c7-a5e0-406e-93ed-1575abf5565f/mount type cifs (rw,relatime,vers=3,cache=strict,username=***stripped***,domain=corp,uid=1000,forceuid,gid=1000,forcegid,addr=***stripped***,file_mode=0775,dir_mode=0775,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)

seif-raafat avatar Jul 07 '22 14:07 seif-raafat

Hi @andyzhangx following up on this, is there a way to periodically monitor (check the health of the mounts) and provide logs when a faulty one is found? This will help us detect issues early on. Taking it a step further, is it possible to eventually solve the broken smb connection issue automatically?

seif-raafat avatar Jul 11 '22 14:07 seif-raafat

Hey guys.

It's what we've been facing for some time. Actually I think it has happened when we switched from using onprem smb shares to DFS shares.

testcronjob 0/1 ContainerCannotRun 0 12h 10.180.64.18 ip-xx-xxx-xxx-xx.ec2.internal test123-27626160-j8g2v 0/1 ContainerCannotRun 0 11h 10.180.97.153 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626310-2r5t5 0/1 ContainerCannotRun 0 8h 10.180.123.85 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626310-5nhmn 0/1 ContainerCannotRun 0 8h 10.180.108.148 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626310-fwb4z 0/1 ContainerCannotRun 0 8h 10.180.108.148 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626310-jwrk9 0/1 ContainerCannotRun 0 8h 10.180.64.18 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626310-ss2tg 0/1 ContainerCannotRun 0 8h 10.180.98.45 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626310-xgll4 0/1 ContainerCannotRun 0 8h 10.180.97.153 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626370-hzg84 0/1 ContainerCannotRun 0 7h43m 10.180.87.164 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626370-jldfm 0/1 ContainerCannotRun 0 7h38m 10.180.112.32 ip-xx-xxx-xxx-xx.ec2.internal xxx-27626370-n6vv8 0/1 ContainerCannotRun 0 7h47m 10.180.115.225 ip-xx-xxx-xxx-xx.ec2.internal

On affected nodes we have the same return for globalmounts:

[ec2-user@ip-xx-xxx-xxx-xx ~]$ sudo ls -l /var/lib/kubelet/plugins/kubernetes.io/csi/pv/smb-pv ls: cannot access /var/lib/kubelet/plugins/kubernetes.io/csi/pv/smb-pv/globalmount: No such file or directory total 4 d????????? ? ? ? ? ? globalmount -rw-r--r-- 1 root root 63 Jul 11 08:35 vol_data.json

It happens at random and we don't know the reason of that, but our cronjobs which we have (~300) keep getting blocked and we need to drain the nodes.

In kernel logs we can find presumably by the time it happens: [Tue Jul 12 21:30:29 2022] CIFS VFS: \share_server\share_name BAD_NETWORK_NAME: \share_server\share_name\path

In stats file we see only information: TreeConnects: 2223 total 2199 failed

Since controller is in the infinite loop, manual unmounting and mounting it doesn't help for newly created pods. We need to drain the whole node and move our resources to newly created.

Skoczek135 avatar Jul 12 '22 07:07 Skoczek135

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 11 '22 13:10 k8s-triage-robot

I think this is caused by Linux not playing well with DFS - I've seen it too and just switched to targeting the downstream server directly.

junkiebev avatar Oct 18 '22 20:10 junkiebev

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 17 '22 20:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Dec 17 '22 21:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 17 '22 21:12 k8s-ci-robot