csi-driver-smb Old share from old server mounted after deleting and recreating volume/deployment with the same name

What happened:

Our deployments are generated from a git repo, volumes are generated from a configfile and get generated + numbered names like this

After a user changed the target server for a smb mount the volume is updated and i can see the new config in the pv

❯ k describe pv pv-cifs-artikelvortopf2web-live-only-live-1
Name:            pv-cifs-artikelvortopf2web-live-only-live-1
Labels:          kapp.k14s.io/app=1663849042183958382
                 kapp.k14s.io/association=v1.6cf92940af1aa5db892a65e20305842d
Annotations:     kapp.k14s.io/identity: v1;//PersistentVolume/pv-cifs-artikelvortopf2web-live-only-live-1;v1
                 kapp.k14s.io/original:
                   {"apiVersion":"v1","kind":"PersistentVolume","metadata":{"labels":{"kapp.k14s.io/app":"1663849042183958382","kapp.k14s.io/association":"v1...
                 kapp.k14s.io/original-diff-md5: d896c768fb4b1cfee46f39dc38dffbbc
                 pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:
Status:          Bound
Claim:           portal/pv-cifs-artikelvortopf2web-live-only-live-1
Reclaim Policy:  Retain
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        5Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            smb.csi.k8s.io
    FSType:
    VolumeHandle:      pv-cifs-artikelvortopf2web-live-only-live-1
    ReadOnly:          false
    VolumeAttributes:      source=//srv-sql-be-live.mydomain.com/BulkFiles/artikelvortopf2
Events:                <none>

But the deployment still mounts the old server

root@artikelvortopf2web-live-84bbf97b5d-8zthw:/var/www# mount | grep cifs
//srv-mssql-be.mydomain.com/BulkFiles on /var/www/bulk type cifs (rw,relatime,vers=3.0,cache=strict,username=user,domain=DOMAIN,uid=0,forceuid,gid=0,forcegid,addr=172.16.1.7,file_mode=0777,dir_mode=0777,soft,nounix,serverino,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=1)
root@artikelvortopf2web-live-84bbf97b5d-8zthw:/var/www#

Im not yet sure whats the problem, my first guess was some kind of caching, because the name of the volume does not change...

What you expected to happen:

Updated mount to the new server

How to reproduce it:

I guess create a volume, use it, delete it , recreate it with the same name but different server... I will try to verify this..

Anything else we need to know?:

Environment:

CSI Driver version: 1.5.0 (Helm Chart version ?)
Kubernetes version (use kubectl version): v1.18.10
OS (e.g. from /etc/os-release): 18.04.4 LTS (GNU/Linux 4.15.0-191-generic x86_64)
Kernel (e.g. uname -a): GNU/Linux 4.15.0-191-generic x86_64
Install tools:
Others:

Sep 22 '22 13:09 erSitzt

Same problem with current version. 1.9.0

Scaling up my deployment i even get different results in each instance....

These are the mounts in three instances of the same deployment, everything is totally mixed up

root@artikelvortopf2web-live-687c6f8f9b-sjkbj:/var/www# mount | grep cifs
//srv-mssql-be.mydomain.com/BulkFiles on /var/www/storage type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/media on /var/www/bulk type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/public/product_images type cifs

root@artikelvortopf2web-live-687c6f8f9b-hcl7v:/var/www# mount | grep cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/storage type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/media on /var/www/bulk type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/public/product_images type cifs

root@artikelvortopf2web-live-687c6f8f9b-hcl7v:/var/www# mount | grep cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/storage type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/media on /var/www/bulk type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/public/product_images type cifs

And these are the mounts and the paths where they should be mounted.:

//10.10.10.195/k8s-cifs/artikelvortopf/live/storage => /var/www/storage
//10.10.10.195/k8s-cifs/artikelvortopf/live/media => /var/www/public/product_images
//srv-sql-be-live.mydomain.com/BulkFiles/artikelvortopf2 => /var/www/bulk

Sep 22 '22 15:09 erSitzt

So after some more checking i found old mounts on some of the workers

//srv-mssql-be.mydomain.com/BulkFiles on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cifs-artikelvortopf2web-live-only-live-1/globalmount

as the mountpoint is using the name of the PV, it seems the mount is not validated but just used instead of creating a new one with the new, correct destination.

After manually unounting these old mounts or restarting the node everything is mounted correctly again.

How does the lifecycle of a mount look like ? When does the globalmount get unmounted ? I saw them on multpiple nodes where no workload/pod was running anymore.

Sep 26 '22 08:09 erSitzt

So after some more checking i found old mounts on some of the workers
//srv-mssql-be.mydomain.com/BulkFiles on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cifs-artikelvortopf2web-live-only-live-1/globalmount
as the mountpoint is using the name of the PV, it seems the mount is not validated but just used instead of creating a new one with the new, correct destination.

After manually unounting these old mounts or restarting the node everything is mounted correctly again.

How does the lifecycle of a mount look like ? When does the globalmount get unmounted ? I saw them on multpiple nodes where no workload/pod was running anymore.

I'm seeing this behaviour too, but unable to find globalmount.

I've torn everything in the cluster down and removed and reinstalled the csi driver, but I'm still seeing this

kubectl exec -it csi-smb-node-lc9pd -n kube-system -c smb -- mount | grep cifs

//10.20.0.10/proxmox-zfs on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cfg/globalmount type cifs //10.20.0.10/proxmox-zfs on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-smb/globalmount type cifs

the two pvs no longer exist, but i'm unable to find /var/lib/kubelet/plugins/kubernetes.io on the hosts.

Edit: nvm. found it on one host. same issue. How were you able to remove the files? Edit2: umount the cifs share on the host, delete the pv files.

Thanks. I thought I was going mental.

In my case, I accidentally mounted the wrong directory in my pv and it seems like it has persisted after that as a result.

Should this not get unmounted from all hosts when the pv is removed, or is that not possible to do?

Sep 26 '22 21:09 xlanor

@xlanor we were moving shares from older servers to new ones. Only the hostname/ip changed, paths and pv names remained unchanged and we had mounts mixed up all over the place... not ideal :( But somewhat similar to your accidental wrong/corrected mount

There should definitely be a full check on pv-name / host / ip and and all mount properties ( which i did not test.. eg changing smb version and stuff.. ) and some kind of cleanup for orphaned/incorrect mounts

Sep 27 '22 09:09 erSitzt

I'm more than open to contributing to this, but would need clarification on what exactly the expected behaviour would be like from the maintainer, like you said w.r.t the lifecycle.

I imagine this might not be a problem in cloud if you tear down hosts regularly and respawn them, but I'm running this on prem in my homelab where this is slightly more problematic

Sep 27 '22 20:09 xlanor

how did you change the server name in pv?

Oct 10 '22 14:10 andyzhangx

PVs are deleted and recreated as far as i know.

Oct 10 '22 15:10 erSitzt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 08 '23 15:01 k8s-triage-robot

Not stale, still an issue

Jan 09 '23 15:01 erSitzt

/remove-lifecycle stale

Jan 16 '23 17:01 erSitzt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 16 '23 18:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 16 '23 18:05 k8s-triage-robot

/remove-lifecycle rotten

May 16 '23 18:05 erSitzt

Is there any workaround for this issue if I have encountered it? i.e. how do I force it to use the correct parameters?

Oct 24 '23 11:10 OronDF343

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 30 '24 23:01 k8s-triage-robot

/remove-lifecycle stale

Jan 30 '24 23:01 OronDF343

if you change the setting in the pv, the change would only take effect when you schedule the pod with that volume mount to other node without having this volume mount, that's by design.

Since this driver uses globalmount, the unmount for specific pv would only happen when the last pod with that volume mount is deleted on the node.

close this issue since it's by design.

Feb 04 '24 03:02 andyzhangx

I had to write a script to umount all related mounts in globalmount, even after all workloads were removed. Anyhow... as long as the behavior is not predictable for the user and changing the mount is allowed.. this is shitty by design.

There should only be two options

Change mount => error not allowed
Change mount => new mount active

Now we have a nice mix of nodes pointing to different shares... just roll a dice to see where you end up 🤷

Feb 05 '24 08:02 erSitzt

I had to write a script to umount all related mounts in globalmount, even after all workloads were removed. Anyhow... as long as the behavior is not predictable for the user and changing the mount is allowed.. this is shitty by design.

There should only be two options
* Change mount => error not allowed

* Change mount => new mount active
Now we have a nice mix of nodes pointing to different shares... just roll a dice to see where you end up 🤷

any chance you can share this script? still running into this issue sadly and suspect it just caused an outage my end..

Apr 01 '24 08:04 xlanor

@xlanor Not really a finished script to share.. i had a list of old servernames that were used in mounts before the migrated to newer servers. script connected to all workers, checked for old mounts using those servernames and umounted them.

Get a list of all PVs with UNC path of the share

kubectl get pv -o json | jq -r '.items[] | select(.spec.csi) | "\(.metadata.name) \(.spec.csi.volumeAttributes.source)"'

pv-cifs-47111a8841df1f56cff925721f851c16e0f9a25f-only-1 //srv-mssql1.mydomain.com/BulkFiles
pv-cifs-62bdeee83e02f2194d860fc264076dc039d45a6d-only-1 //srv-sql-be1.mydomain.com/BulkFiles/artikelvortopf2
pv-cifs-6f675830524d0ba158cb5bb4d1cedda4ec19ddd5-only-1 //srv-nav-sql.mydomain.com/BULKFiles

grep what for the old servernames, connecto to all workers and run

umount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cifs-6f675830524d0ba158cb5bb4d1cedda4ec19ddd5-only-1/globalmount

I think that was it overall..

Apr 09 '24 11:04 erSitzt

csi-driver-smb csi-driver-smb copied to clipboard

Old share from old server mounted after deleting and recreating volume/deployment with the same name

csi-driver-smb
csi-driver-smb copied to clipboard