azuredisk-csi-driver Manual Volume detach case not handled

What happened:

After a volume was manually detached from a VM, two pods were mistakenly using the same Volume as their mounted volume.

What you expected to happen:

In an AWS cluster, this same issue gives this error message:

  Warning  FailedMount  14s (x6 over 30s)  kubelet            MountVolume.MountDevice failed for volume "pvc-XXXXX" : rpc error: code = Internal desc = Failed to find device path /dev/xvdaa. refusing to mount /dev/nvme3n1 because it claims to be volX but should be volY

I would expect this to be handled in a similar manner.

How to reproduce it:

Create a pod that mounts a PVC.
After the pod is running, manually detach the disk using the azure portal. (The pod will still show as running)
Create another pod that mounts a PVC and assign it to the same node.
Both pods should be running at this point.
Delete & recreate the first pod
The pod should go into a running state without the volume attaching
At this point, they will both be using the same Volume
To verify you can exec into both pods, create a file in the mounted directory in one and verify that it's shown in the other pod

Anything else we need to know?:

Environment:

CSI Driver version: 1.30.4
Kubernetes version (use kubectl version): 1.28.9
OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS
Kernel (e.g. uname -a): Linux 5.4.0-1138-azure # 145-Ubuntu SMP Fri Aug 30 16:04:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

Sep 28 '24 00:09 CoreyCook8

this is expected since on Azure VM, device name is not bound to disk name, e.g. disk1 is mounted as /dev/sdc, and when disk1 is manually detached and disk2 is attached to the VM, disk2 is mounted as /dev/sdc, if you delete & recreate the first pod with disk1 volume, then disk1 would still use /dev/sdc since at that time CSI driver thinks that disk1 is still attached to the VM, it would just reuse the previous device name(de/sdc).

BTW, manual volume detach is not supported CSI driver scenario, that's out of CSI driver control.

Oct 08 '24 13:10 andyzhangx

I understand that manual detach is out of the control of the csi driver. But, I would expect the CSI driver to ensure that a new pod is using the volume it has requested and not another pod's volume. If the pod is deleted, and the new pod is attached to the same VM, I would expect the csi driver to check the drive, and make sure the expected volume == the actual volume.

Or, when attaching the second disk to the same drive as the first disk, it would realize that a disk should already be there / realize that the first disk is no longer attached.

Oct 08 '24 14:10 CoreyCook8

due to the manual detach, the kubelet thinks that the disk1 is already attached to the node, thus CSI driver won't be called (no NodeStageVolume call) to verify the drive.

When attaching disk2 to the VM, using the same device name(/dev/sdc) is actually ok (this is also out of CSI driver control, it's controlled by linux disk kernel driver), I think the main problem is that when you do the manual detach, you should reschedule the first pod to other node, that would work. Otherwise we don't have a solution how to make this work since it's out of CSI driver control

Oct 08 '24 14:10 andyzhangx

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 06 '25 14:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 05 '25 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 07 '25 15:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Mar 07 '25 15:03 k8s-ci-robot

azuredisk-csi-driver azuredisk-csi-driver copied to clipboard

Manual Volume detach case not handled

azuredisk-csi-driver
azuredisk-csi-driver copied to clipboard