csi-driver-smb icon indicating copy to clipboard operation
csi-driver-smb copied to clipboard

SMB volume map disconnecting on windows nodes.

Open rreaplli opened this issue 1 year ago • 11 comments

What happened: SMB volume not mounting intermittently to the container getting below errors in the pod events, csi-smb-node-win driver pod logs are showing access denied, volume showing disconnected on the windows worker node, applying the below workaround to fix the issue, looking for permanent solution to fix this problem, please provide the resolution if you come across this issue.

Error: MountVolume.MountDevice failed for volume "ntfs-logs" : kubernetes.io/csi: attacher.MountDevice failed to create dir "\var\lib\kubelet\plugins\kubernetes.io\csi\smb.csi.k8s.io\4e5012244d1604e40fc127a03220a74836a874f6d38386cf183428b777f34f64\globalmount": mkdir \var\lib\kubelet\plugins\kubernetes.io\csi\smb.csi.k8s.io\4e5012244d1604e40fc127a03220a74836a874f6d38386cf183428b777f34f64\globalmount: Cannot create a file when that file already

Workaround:

  1. To identify the broken SMB connection (Should say disconnected) Run: Net use
  2. Remove the broken share: Remove-SMBGlobalMapping \path
  3. Enter the credentials for the SMB share: $creds = Get-Credential
  4. Create the Mapping: New-SmbGlobalMapping -RemotePath \fs1.contoso.com\public -Credential $creds Pods should then be able to connect to the SMB share

What you expected to happen:

volume should map to the container and no disconnection on the SMBGlobalMapping, all of sudden pod crashing dueto the volume mapping disconnected and not able to connect the volume.

How to reproduce it: it's intermittent issue, whenever we reboot the node it's working fine, not able to reproduce it.

Anything else we need to know?: we are using the csi-provisioner:v3.5.0 and csi-node-driver-registrar:v2.10.0 are using in our environment.

Environment: dev environment

  • CSI Driver version: v2.10.0
  • Kubernetes version (use kubectl version): v1.26.9+c7606e7
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):windows2022
  • Install tools:
  • Others:

rreaplli avatar Jan 18 '24 13:01 rreaplli

Questions - I don't have the solution - which version are you on, v1.14? And how many smb csi controllers do you have? What's your backend storage?

We had similar problems with v1.12-v1.13, when we had 2 controllers in values.yaml when deployed by helm, with Azure NetApp Files SMB share

After changing to 1 controller on v1.14 we've had no drops but it's only been a day

deedubb avatar Jan 19 '24 07:01 deedubb

Thanks for the prompt response @deedubb , kubernetes version is v1.26, one smb controller, netapp is the backend storage.

we have tried the csi-node-driver v2.10.0 but no luck, issue reoccurring after some time.

rreaplli avatar Jan 19 '24 17:01 rreaplli

using the v1.12 in our environment, will try with v1.14

rreaplli avatar Jan 19 '24 17:01 rreaplli

I don't think that's related to CSI driver version, the SMBGlobalMapping is broken on windows node, that's the root cause.

andyzhangx avatar Jan 20 '24 02:01 andyzhangx

@andyzhangx could you be more verbose? Can you suggest some configuration or how to troubleshoot and mitigate such problems? What I find annoying is that the share might go offline temporarily, connectivity might get broken; but the share wouldn't re-establish on its own for me. I had to evict the node and provision a new one. I also didn't have a very good node readiness that the share was actually accessible. If we can detect the share is unavailable/gone offline maybe an example of the readiness or liveness could be provided?

deedubb avatar Jan 20 '24 05:01 deedubb

Thanks for the reply @andyzhangx and @deedubb, yes exactly SMBGlobalMapping is broken on the node, after we applied below workaround it's working without the node reboot, do we have any permanent fix for this problem?

  1. To identify the broken SMB connection (Should say disconnected) Run: Net use
  2. Remove the broken share: Remove-SMBGlobalMapping \path
  3. Enter the credentials for the SMB share: $creds = Get-Credential
  4. Create the Mapping: New-SmbGlobalMapping -RemotePath \fs1.contoso.com\public -Credential $creds
  5. Pods should then be able to connect to the SMB share

rreaplli avatar Jan 21 '24 07:01 rreaplli

when the smb mount is broken on the node, the only k8s way is to remove that mount path by deleting the pod and then smb volume mount would happen again. There is no suitable fix from CSI driver side since from CSI driver perspective, after the mount is complete, it does not monitor whether the underlying mount is healthy or not, that's out of CSI driver scope.

I think you need to check why SMBGlobalMapping mount is broken frequently on the node, is there any way for SMBGlobalMapping to recover by itself?

andyzhangx avatar Feb 04 '24 03:02 andyzhangx

@andyzhangx I appreciate that it might not be in the csi driver's scope; but in the "I use this software package to build and maintain my SMB/cifs share connectivity" mindset, would you have any suggestions for how to monitor the share, repair the share and/or mark the node liveliness as unhealthy?

deedubb avatar Feb 04 '24 15:02 deedubb

@andyzhangx I appreciate that it might not be in the csi driver's scope; but in the "I use this software package to build and maintain my SMB/cifs share connectivity" mindset, would you have any suggestions for how to monitor the share, repair the share and/or mark the node liveliness as unhealthy?

@deedubb if smb volume is invalid on the node, the pod could have some liveness to check the mount path, and crash if smb volume is invalid, and then you may have your operator to start another new pod on the node, and then delete crashing pod.

andyzhangx avatar Feb 05 '24 08:02 andyzhangx

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 05 '24 09:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 04 '24 09:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jul 04 '24 10:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 04 '24 10:07 k8s-ci-robot