vsphere-csi-driver GetDevMounts: eval symlinks found in mount table

Each Device object has a RealDev field showing the underlying device path after resolving symlinks. GetDevMounts checks each device in the mount table against the RealDev field and only returns mounts for that device. However, the mount table may have symlinks, such as /dev/mapper devices, and it should resolve those before comparing to RealDev.

What this PR does / why we need it:

This issue was reported on RHEL 8 / RHEL 9 nodes with multipath enabled.

touch /etc/multipath.conf && systemctl restart multipathd

Mounting CSI volumes then fails with:

Warning FailedMount 3s (x4 over 7s) kubelet MountVolume.SetUp failed for volume "pvc-77d05006-3912-4ba1-b6ae-f19c5e766d37" : rpc error: code = FailedPrecondition desc = volume ID: "ff8a685e-d1fc-4f05-bea6-46ba15133a4f" does not appear staged to "/var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/219f1da44c39eb24d8c03677ee0907cedd00e9a8924a4dd288720a89704e19f7/globalmount"

PublishMountVol claims that the volume is not staged because GetDevMounts did not find the device in the mount table.

2023-10-06T19:47:58.862Z DEBUG osutils/linux_os_utils.go:418 publishMountVol: device {FullPath:/dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8 Name:wwn-0x6000c295f967d947c5528549554637c8 RealDev:/dev/dm-0}, device mounts [] {"TraceId": "e4cb48f6-dc34-4163-bc31-071b6db5c57d"}

RealDev shows as /dev/dm-0 here, but the mount table shows /dev/mapper/36000c295f967d947c5528549554637c8:

sh-5.1# mount | grep csi
/dev/mapper/36000c295f967d947c5528549554637c8 on /var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/219f1da44c39eb24d8c03677ee0907cedd00e9a8924a4dd288720a89704e19f7/globalmount type ext4 (rw,relatime,seclabel)

sh-5.1# ls -l /dev/mapper/36000c295f967d947c5528549554637c8
lrwxrwxrwx. 1 root root 7 Oct  6 19:39 /dev/mapper/36000c295f967d947c5528549554637c8 -> ../dm-0
sh-5.1# ls -l /dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8
lrwxrwxrwx. 1 root root 10 Oct  6 19:39 /dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8 -> ../../dm-0

It points to the same underlying device (/dev/dm-0) though.

RealDev is set in GetDevice after evaluating symlinks, but GetDevMounts compares RealDev to the device in the mount table. The mount table may reference a symlink instead of the real device.

This PR allows GetDevMounts to resolve symlinks found in the mount table before comparing it to RealDev.

Which issue this PR fixes:

Testing done:

Reproduced this issue by enabling multipath on RHEL 9.2 hosts and creating a PVC & Pod using vSphere CSI driver
Updated CSI driver node daemonset to a build with this proposed fix
Recreated the failed pod, it attached and mounted the volume successfully

With the fix:

2023-10-06T20:52:18.859Z DEBUG osutils/linux_os_utils.go:418 publishMountVol: device {FullPath:/dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8 Name:wwn-0x6000c295f967d947c5528549554637c8 RealDev:/dev/dm-0}, device mounts [{"/dev/mapper/36000c295f967d947c5528549554637c8" "/var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/219f1da44c39eb24d8c03677ee0907cedd00e9a8924a4dd288720a89704e19f7/globalmount" "/dev/mapper/36000c295f967d947c5528549554637c8" "ext4" ["rw" "relatime"]}] {"TraceId": "8252d8a9-d77c-4b1f-87dc-a11122ff78d5"}

Special notes for your reviewer:

/cc @divyenpatel

Release note:

Fix GetDevMounts when device in mount table is a symlink

Oct 06 '23 22:10 dobsonj

/hold I'm considering other approaches after experimenting with other CSI drivers

Oct 11 '23 23:10 dobsonj

cc: @shalini-b

Oct 17 '23 00:10 divyenpatel

@dobsonj what are the other approaches you are experimenting?

Do we have common code available in https://github.com/kubernetes/mount-utils which can be consumed by CSI Drivers that takes care of multi-path configuration?

Oct 17 '23 15:10 divyenpatel

/ok-to-test

Oct 17 '23 15:10 divyenpatel

@dobsonj what is the use case you are targeting for enabling device mapper on the node?

since logical grouping of volumes attached to the node is not happening per pod, but for all volumes attached to the node vm (volumes from all pods on the node), Is there any benefit you will get enabling device mapper on the node?

Oct 24 '23 18:10 divyenpatel

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 22 '24 19:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 21 '24 19:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Mar 22 '24 20:03 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 22 '24 20:03 k8s-ci-robot

@dobsonj @divyenpatel I'm wondering if there has been any progress on this front? I have a scenario where we may not be able to disable multipathd to get the vSphere CSI driver to work and this fix seems helpful. I'd be happy to pick up where you left off here if this hasn't already been fixed in some other way.

Mar 29 '24 17:03 jrife

/reopen

Mar 29 '24 17:03 jrife

@jrife: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 29 '24 17:03 k8s-ci-robot

@xing-yang @divyenpatel Could you help reopen this issue? @jrife could help follow up on this.

Apr 06 '24 00:04 jingxu97

/remove-lifecycle rotten

Apr 08 '24 17:04 xing-yang

@dobsonj @divyenpatel I'm wondering if there has been any progress on this front? I have a scenario where we may not be able to disable multipathd to get the vSphere CSI driver to work and this fix seems helpful. I'd be happy to pick up where you left off here if this hasn't already been fixed in some other way.

Apologies, I've been meaning to get back to this for quite some time now.

When I first looked into this it appeared to be simply an issue with symlink evaluation in vsphere-csi-driver, but this is a broader issue for other CSI drivers as well. So while this PR does allow the mount to succeed on vsphere when multipath is enabled, I'm not sure it's the right approach.

The same problem exists for AWS EBS when multipath is enabled:

Warning FailedMount 0s kubelet MountVolume.MountDevice failed for volume "pvc-38ee8d9e-f4e9-4f52-a73b-3176e36d8f8c" : rpc error: code = Internal desc = could not format "/dev/nvme1n1" and mount it at "/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/6e5575fe544292c1e04ad5e99b185211594db1a051d6e73134b26f6ae91d0723/globalmount": format of disk "/dev/nvme1n1" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/6e5575fe544292c1e04ad5e99b185211594db1a051d6e73134b26f6ae91d0723/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)

And for GCP PD:

Warning FailedMount 4s (x8 over 72s) kubelet MountVolume.MountDevice failed for volume "pvc-79fd4955-c783-429b-ba55-c419e9859def" : rpc error: code = Internal desc = Failed to format and mount device from ("/dev/disk/by-id/google-pvc-79fd4955-c783-429b-ba55-c419e9859def") to ("/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/57c27c6d933c3e4cf2de2d1e66d73045f9da9209fb79f880de7d7712b0ae03f7/globalmount") with fstype ("ext4") and options ([]): format of disk "/dev/disk/by-id/google-pvc-79fd4955-c783-429b-ba55-c419e9859def" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/57c27c6d933c3e4cf2de2d1e66d73045f9da9209fb79f880de7d7712b0ae03f7/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)

So aws-ebs and gcp-pd both fail early, during NodeStageVolume, because the disk is already in use (by multipathd).

err = d.mounter.FormatAndMountSensitiveWithFormatOptions(source, target, fsType, mountOptions, nil, formatOptions)

However vsphere uses a different library (akutz/gofsutil) that does not fail in NodeStageVolume:

err := gofsutil.FormatAndMount(ctx, dev.FullPath, params.StagingTarget, params.FsType, params.MntFlags...)

So the vsphere driver proceeds to format the device and then fails later in NodePublishVolume because the symlink check did not find the device in the mount table, and assumes the device was not staged after all.

We should have a consistent solution for all CSI drivers. I worry about the consequences of formatting a device that multipathd has claimed... which implies that the correct solution might be instead be to fail in NodeStageVolume as we have in aws-ebs and gcp-pd today. But that leaves us with no solution when multipathd is enabled with an empty config (which is required by some third-party drivers).

Apr 10 '24 16:04 dobsonj

the correct solution might be instead be to fail in NodeStageVolume as we have in aws-ebs and gcp-pd today

We discussed this in the CSI meeting yesterday and agreed I should rework my PR so that the vsphere driver fails to format devices owned by multipathd in NodeStageVolume, same as aws-ebs and gcp-pd. It can be dangerous and should be prevented.

In essence, multipathd must be configured correctly on the node. If it's not, and it takes over block devices used by the CSI driver, then the CSI driver can't touch those devices.

Apr 11 '24 14:04 dobsonj

@dobsonj Thanks for looking into this.

the correct solution might be instead be to fail in NodeStageVolume as we have in aws-ebs and gcp-pd today

We discussed this in the CSI meeting yesterday and agreed I should rework my PR so that the vsphere driver fails to format devices owned by multipathd in NodeStageVolume, same as aws-ebs and gcp-pd. It can be dangerous and should be prevented.

For my scenario I'm afraid this would be a step in the wrong direction.

In essence, multipathd must be configured correctly on the node.

This may be easier said than done. For some context, the scenario I am trying to solve for involves having the vSphere CSI driver installed alongside the NetApp CSI Driver. NetApp Trident's iSCSI driver requires a certain multipathd configuration for each of your nodes:

defaults {
    user_friendly_names yes
    find_multipaths no
}

This will by default cause multipathd to "claim" any block device you attach to your node including those provisioned by the vSphere CSI driver. In essence, NetApp's recommended configuration is in direct conflict with proper operation of the vSphere CSI driver. You can blacklist certain devices, so that multipathd doesn't touch them,

blacklist {
       device {
               vendor  "IBM"
               product "3S42"       #DS4200 Product 10
       }
       device {
               vendor  "HP"
               product "*"
       }
       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
       devnode "^(td|ha)d[a-z]"
}

but this is a bit fiddly and something I'd like to avoid having to ask our users to do. It's not yet clear how easy it is to craft a one-size-fits-all config here that ignores any vSphere CSI volumes and selects any volumes NetApp is interested in.

If it's not, and it takes over block devices used by the CSI driver, then the CSI driver can't touch those devices.

I'm skeptical of the actual risk here. I agree in principle, but practically speaking in the case of the vSphere CSI driver we're formatting, attaching, mounting, etc. devices we know to be under the control of the vSphere CSI driver, not random devices on the system.

My hope would be that we could make this behavior configurable. By default, maybe it's safer to refuse to touch any devices claimed by multipathd, but in scenarios like mine where I'm confident that there is no risk I should be able to opt into it (e.g. --resolve-device-symlinks=(true|false)

Apr 11 '24 16:04 jrife

Bump @xing-yang @dobsonj

Apr 30 '24 16:04 jrife

Hey @jrife, I rebased and tested a new commit to use the FormatAndMount function from mountutils, same as the other drivers, and it did not fail in NodeStageVolume. On gcp-pd for example the mount command itself fails: /dev/disk/by-id/google-pvc-79fd4955-c783-429b-ba55-c419e9859def is apparently in use by the system; will not make a filesystem here! This difference between platforms is puzzling to me, but we don't necessarily need to solve that here. As far as I can tell, the symlink evaluation in GetDevMounts is safe and solves the core issue. If the second commit (to use mountutils) is controversial, I can drop it, as it has no functional impact.

Apr 30 '24 22:04 dobsonj

This difference between platforms is puzzling to me, but we don't necessarily need to solve that here. As far as I can tell, the symlink evaluation in GetDevMounts is safe and solves the core issue. If the second commit (to use mountutils) is controversial, I can drop it, as it has no functional impact.

Hi @dobsonj. Thanks for following up. This is interesting. Do you mean that FormatAndMount works fine on vSphere disks even when the devices are claimed by multipathd?

Apr 30 '24 23:04 jrife

Hi @dobsonj. Thanks for following up. This is interesting. Do you mean that FormatAndMount works fine on vSphere disks even when the devices are claimed by multipathd?

Yes, that's correct. From the logs with this change:

I0430 23:08:06.718501       1 mount_linux.go:517] Disk "/dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6" appears to be unformatted, attempting to format as type: "ext4" with options: [-F -m0 /dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6]
I0430 23:08:07.806132       1 mount_linux.go:528] Disk successfully formatted (mkfs): ext4 - /dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6 /var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/76bd659ef256a6255f2ec1f4728dba12451ab11d022e24398d94c2c8948a270a/globalmount

Apr 30 '24 23:04 dobsonj

One notable difference from that log output, is that the vsphere driver formats and mounts /dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6 which is symlinked to /dev/dm-0, and it succeeds. In contrast, aws-ebs uses /dev/nvme1n1 and gcp-pd uses /dev/disk/by-id/google-pvc-....

May 01 '24 00:05 dobsonj

/unhold

May 01 '24 15:05 dobsonj

@dobsonj OK, that works for me then. I have no strong opinion on the second commit here. I'll let @xing-yang and @divyenpatel chime in on the rest.

May 01 '24 15:05 jrife

I went ahead and dropped the second commit ( https://github.com/kubernetes-sigs/vsphere-csi-driver/commit/57f8773014c5947dcce1c1040bec90965ee2791a ) since it doesn't affect this issue, it just changes how the driver calls mkfs.

IMO, we can move forward with the GetDevMounts fix. /cc @xing-yang @divyenpatel

May 01 '24 17:05 dobsonj

cc @divyenpatel

May 01 '24 18:05 xing-yang

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: divyenpatel, dobsonj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [divyenpatel]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 10 '24 20:05 k8s-ci-robot

Running pre-checkin pipelines before merging this PR.

May 10 '24 21:05 divyenpatel

Started Vanilla file pre-checkin pipeline... Build Number: 829

May 10 '24 21:05 svcbot-qecnsdp

Build ID: 829
File vanilla build status: FAILURE 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 
FAIL! -- 26 Passed | 6 Failed | 0 Pending | 816 Skipped
--- FAIL: TestE2E (4977.77s)
FAIL

Ginkgo ran 1 suite in 1h24m17.564314192s

Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-file-vanilla-precheckin/829/vsphere-csi-driver`

May 10 '24 23:05 svcbot-qecnsdp

vsphere-csi-driver vsphere-csi-driver copied to clipboard

GetDevMounts: eval symlinks found in mount table

vsphere-csi-driver
vsphere-csi-driver copied to clipboard