csi-driver-smb icon indicating copy to clipboard operation
csi-driver-smb copied to clipboard

Old share from old server mounted after deleting and recreating volume/deployment with the same name

Open erSitzt opened this issue 3 years ago • 1 comments

What happened:

Our deployments are generated from a git repo, volumes are generated from a configfile and get generated + numbered names like this

image

After a user changed the target server for a smb mount the volume is updated and i can see the new config in the pv

❯ k describe pv pv-cifs-artikelvortopf2web-live-only-live-1
Name:            pv-cifs-artikelvortopf2web-live-only-live-1
Labels:          kapp.k14s.io/app=1663849042183958382
                 kapp.k14s.io/association=v1.6cf92940af1aa5db892a65e20305842d
Annotations:     kapp.k14s.io/identity: v1;//PersistentVolume/pv-cifs-artikelvortopf2web-live-only-live-1;v1
                 kapp.k14s.io/original:
                   {"apiVersion":"v1","kind":"PersistentVolume","metadata":{"labels":{"kapp.k14s.io/app":"1663849042183958382","kapp.k14s.io/association":"v1...
                 kapp.k14s.io/original-diff-md5: d896c768fb4b1cfee46f39dc38dffbbc
                 pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:
Status:          Bound
Claim:           portal/pv-cifs-artikelvortopf2web-live-only-live-1
Reclaim Policy:  Retain
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        5Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            smb.csi.k8s.io
    FSType:
    VolumeHandle:      pv-cifs-artikelvortopf2web-live-only-live-1
    ReadOnly:          false
    VolumeAttributes:      source=//srv-sql-be-live.mydomain.com/BulkFiles/artikelvortopf2
Events:                <none>

But the deployment still mounts the old server

root@artikelvortopf2web-live-84bbf97b5d-8zthw:/var/www# mount | grep cifs
//srv-mssql-be.mydomain.com/BulkFiles on /var/www/bulk type cifs (rw,relatime,vers=3.0,cache=strict,username=user,domain=DOMAIN,uid=0,forceuid,gid=0,forcegid,addr=172.16.1.7,file_mode=0777,dir_mode=0777,soft,nounix,serverino,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=1)
root@artikelvortopf2web-live-84bbf97b5d-8zthw:/var/www# 

Im not yet sure whats the problem, my first guess was some kind of caching, because the name of the volume does not change...

What you expected to happen:

Updated mount to the new server

How to reproduce it:

I guess create a volume, use it, delete it , recreate it with the same name but different server... I will try to verify this..

Anything else we need to know?:

Environment:

  • CSI Driver version: 1.5.0 (Helm Chart version ?)
  • Kubernetes version (use kubectl version): v1.18.10
  • OS (e.g. from /etc/os-release): 18.04.4 LTS (GNU/Linux 4.15.0-191-generic x86_64)
  • Kernel (e.g. uname -a): GNU/Linux 4.15.0-191-generic x86_64
  • Install tools:
  • Others:

erSitzt avatar Sep 22 '22 13:09 erSitzt

Same problem with current version. 1.9.0

Scaling up my deployment i even get different results in each instance....

These are the mounts in three instances of the same deployment, everything is totally mixed up

root@artikelvortopf2web-live-687c6f8f9b-sjkbj:/var/www# mount | grep cifs
//srv-mssql-be.mydomain.com/BulkFiles on /var/www/storage type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/media on /var/www/bulk type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/public/product_images type cifs

root@artikelvortopf2web-live-687c6f8f9b-hcl7v:/var/www# mount | grep cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/storage type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/media on /var/www/bulk type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/public/product_images type cifs

root@artikelvortopf2web-live-687c6f8f9b-hcl7v:/var/www# mount | grep cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/storage type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/media on /var/www/bulk type cifs
//10.10.10.195/k8s-cifs/artikelvortopf/live/storage on /var/www/public/product_images type cifs

And these are the mounts and the paths where they should be mounted.:

//10.10.10.195/k8s-cifs/artikelvortopf/live/storage => /var/www/storage
//10.10.10.195/k8s-cifs/artikelvortopf/live/media => /var/www/public/product_images
//srv-sql-be-live.mydomain.com/BulkFiles/artikelvortopf2 => /var/www/bulk

erSitzt avatar Sep 22 '22 15:09 erSitzt

So after some more checking i found old mounts on some of the workers

//srv-mssql-be.mydomain.com/BulkFiles on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cifs-artikelvortopf2web-live-only-live-1/globalmount

as the mountpoint is using the name of the PV, it seems the mount is not validated but just used instead of creating a new one with the new, correct destination.

After manually unounting these old mounts or restarting the node everything is mounted correctly again.

How does the lifecycle of a mount look like ? When does the globalmount get unmounted ? I saw them on multpiple nodes where no workload/pod was running anymore.

erSitzt avatar Sep 26 '22 08:09 erSitzt

So after some more checking i found old mounts on some of the workers

//srv-mssql-be.mydomain.com/BulkFiles on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cifs-artikelvortopf2web-live-only-live-1/globalmount

as the mountpoint is using the name of the PV, it seems the mount is not validated but just used instead of creating a new one with the new, correct destination.

After manually unounting these old mounts or restarting the node everything is mounted correctly again.

How does the lifecycle of a mount look like ? When does the globalmount get unmounted ? I saw them on multpiple nodes where no workload/pod was running anymore.

I'm seeing this behaviour too, but unable to find globalmount.

I've torn everything in the cluster down and removed and reinstalled the csi driver, but I'm still seeing this

kubectl exec -it csi-smb-node-lc9pd -n kube-system -c smb -- mount | grep cifs

//10.20.0.10/proxmox-zfs on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cfg/globalmount type cifs //10.20.0.10/proxmox-zfs on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-smb/globalmount type cifs

the two pvs no longer exist, but i'm unable to find /var/lib/kubelet/plugins/kubernetes.io on the hosts.

Edit: nvm. found it on one host. same issue. How were you able to remove the files? Edit2: umount the cifs share on the host, delete the pv files.

Thanks. I thought I was going mental.

In my case, I accidentally mounted the wrong directory in my pv and it seems like it has persisted after that as a result.

Should this not get unmounted from all hosts when the pv is removed, or is that not possible to do?

xlanor avatar Sep 26 '22 21:09 xlanor

@xlanor we were moving shares from older servers to new ones. Only the hostname/ip changed, paths and pv names remained unchanged and we had mounts mixed up all over the place... not ideal :( But somewhat similar to your accidental wrong/corrected mount

There should definitely be a full check on pv-name / host / ip and and all mount properties ( which i did not test.. eg changing smb version and stuff.. ) and some kind of cleanup for orphaned/incorrect mounts

erSitzt avatar Sep 27 '22 09:09 erSitzt

I'm more than open to contributing to this, but would need clarification on what exactly the expected behaviour would be like from the maintainer, like you said w.r.t the lifecycle.

I imagine this might not be a problem in cloud if you tear down hosts regularly and respawn them, but I'm running this on prem in my homelab where this is slightly more problematic

xlanor avatar Sep 27 '22 20:09 xlanor

how did you change the server name in pv?

andyzhangx avatar Oct 10 '22 14:10 andyzhangx

PVs are deleted and recreated as far as i know.

erSitzt avatar Oct 10 '22 15:10 erSitzt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 08 '23 15:01 k8s-triage-robot

Not stale, still an issue

erSitzt avatar Jan 09 '23 15:01 erSitzt

/remove-lifecycle stale

erSitzt avatar Jan 16 '23 17:01 erSitzt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 16 '23 18:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 16 '23 18:05 k8s-triage-robot

/remove-lifecycle rotten

erSitzt avatar May 16 '23 18:05 erSitzt

Is there any workaround for this issue if I have encountered it? i.e. how do I force it to use the correct parameters?

OronDF343 avatar Oct 24 '23 11:10 OronDF343

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 30 '24 23:01 k8s-triage-robot

/remove-lifecycle stale

OronDF343 avatar Jan 30 '24 23:01 OronDF343

if you change the setting in the pv, the change would only take effect when you schedule the pod with that volume mount to other node without having this volume mount, that's by design.

Since this driver uses globalmount, the unmount for specific pv would only happen when the last pod with that volume mount is deleted on the node.

close this issue since it's by design.

andyzhangx avatar Feb 04 '24 03:02 andyzhangx

I had to write a script to umount all related mounts in globalmount, even after all workloads were removed. Anyhow... as long as the behavior is not predictable for the user and changing the mount is allowed.. this is shitty by design.

There should only be two options

  • Change mount => error not allowed
  • Change mount => new mount active

Now we have a nice mix of nodes pointing to different shares... just roll a dice to see where you end up 🤷

erSitzt avatar Feb 05 '24 08:02 erSitzt

I had to write a script to umount all related mounts in globalmount, even after all workloads were removed. Anyhow... as long as the behavior is not predictable for the user and changing the mount is allowed.. this is shitty by design.

There should only be two options

* Change mount => error not allowed

* Change mount => new mount active

Now we have a nice mix of nodes pointing to different shares... just roll a dice to see where you end up 🤷

any chance you can share this script? still running into this issue sadly and suspect it just caused an outage my end..

xlanor avatar Apr 01 '24 08:04 xlanor

@xlanor Not really a finished script to share.. i had a list of old servernames that were used in mounts before the migrated to newer servers. script connected to all workers, checked for old mounts using those servernames and umounted them.

Get a list of all PVs with UNC path of the share

kubectl get pv -o json | jq -r '.items[] | select(.spec.csi) | "\(.metadata.name) \(.spec.csi.volumeAttributes.source)"'

pv-cifs-47111a8841df1f56cff925721f851c16e0f9a25f-only-1 //srv-mssql1.mydomain.com/BulkFiles
pv-cifs-62bdeee83e02f2194d860fc264076dc039d45a6d-only-1 //srv-sql-be1.mydomain.com/BulkFiles/artikelvortopf2
pv-cifs-6f675830524d0ba158cb5bb4d1cedda4ec19ddd5-only-1 //srv-nav-sql.mydomain.com/BULKFiles

grep what for the old servernames, connecto to all workers and run

umount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-cifs-6f675830524d0ba158cb5bb4d1cedda4ec19ddd5-only-1/globalmount

I think that was it overall..

erSitzt avatar Apr 09 '24 11:04 erSitzt