vsphere-csi-driver
vsphere-csi-driver copied to clipboard
GetDevMounts: eval symlinks found in mount table
Each Device object has a RealDev field showing the underlying device path after resolving symlinks. GetDevMounts checks each device in the mount table against the RealDev field and only returns mounts for that device. However, the mount table may have symlinks, such as /dev/mapper devices, and it should resolve those before comparing to RealDev.
What this PR does / why we need it:
This issue was reported on RHEL 8 / RHEL 9 nodes with multipath enabled.
touch /etc/multipath.conf && systemctl restart multipathd
Mounting CSI volumes then fails with:
Warning FailedMount 3s (x4 over 7s) kubelet MountVolume.SetUp failed for volume "pvc-77d05006-3912-4ba1-b6ae-f19c5e766d37" : rpc error: code = FailedPrecondition desc = volume ID: "ff8a685e-d1fc-4f05-bea6-46ba15133a4f" does not appear staged to "/var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/219f1da44c39eb24d8c03677ee0907cedd00e9a8924a4dd288720a89704e19f7/globalmount"
PublishMountVol claims that the volume is not staged because GetDevMounts did not find the device in the mount table.
2023-10-06T19:47:58.862Z DEBUG osutils/linux_os_utils.go:418 publishMountVol: device {FullPath:/dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8 Name:wwn-0x6000c295f967d947c5528549554637c8 RealDev:/dev/dm-0}, device mounts [] {"TraceId": "e4cb48f6-dc34-4163-bc31-071b6db5c57d"}
RealDev shows as /dev/dm-0
here, but the mount table shows /dev/mapper/36000c295f967d947c5528549554637c8
:
sh-5.1# mount | grep csi
/dev/mapper/36000c295f967d947c5528549554637c8 on /var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/219f1da44c39eb24d8c03677ee0907cedd00e9a8924a4dd288720a89704e19f7/globalmount type ext4 (rw,relatime,seclabel)
sh-5.1# ls -l /dev/mapper/36000c295f967d947c5528549554637c8
lrwxrwxrwx. 1 root root 7 Oct 6 19:39 /dev/mapper/36000c295f967d947c5528549554637c8 -> ../dm-0
sh-5.1# ls -l /dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8
lrwxrwxrwx. 1 root root 10 Oct 6 19:39 /dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8 -> ../../dm-0
It points to the same underlying device (/dev/dm-0) though.
RealDev is set in GetDevice after evaluating symlinks, but GetDevMounts compares RealDev to the device in the mount table. The mount table may reference a symlink instead of the real device.
This PR allows GetDevMounts to resolve symlinks found in the mount table before comparing it to RealDev.
Which issue this PR fixes:
Testing done:
- Reproduced this issue by enabling multipath on RHEL 9.2 hosts and creating a PVC & Pod using vSphere CSI driver
- Updated CSI driver node daemonset to a build with this proposed fix
- Recreated the failed pod, it attached and mounted the volume successfully
With the fix:
2023-10-06T20:52:18.859Z DEBUG osutils/linux_os_utils.go:418 publishMountVol: device {FullPath:/dev/disk/by-id/wwn-0x6000c295f967d947c5528549554637c8 Name:wwn-0x6000c295f967d947c5528549554637c8 RealDev:/dev/dm-0}, device mounts [{"/dev/mapper/36000c295f967d947c5528549554637c8" "/var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/219f1da44c39eb24d8c03677ee0907cedd00e9a8924a4dd288720a89704e19f7/globalmount" "/dev/mapper/36000c295f967d947c5528549554637c8" "ext4" ["rw" "relatime"]}] {"TraceId": "8252d8a9-d77c-4b1f-87dc-a11122ff78d5"}
Special notes for your reviewer:
/cc @divyenpatel
Release note:
Fix GetDevMounts when device in mount table is a symlink
/hold I'm considering other approaches after experimenting with other CSI drivers
cc: @shalini-b
@dobsonj what are the other approaches you are experimenting?
Do we have common code available in https://github.com/kubernetes/mount-utils which can be consumed by CSI Drivers that takes care of multi-path configuration?
/ok-to-test
@dobsonj what is the use case you are targeting for enabling device mapper on the node?
since logical grouping of volumes attached to the node is not happening per pod, but for all volumes attached to the node vm (volumes from all pods on the node), Is there any benefit you will get enabling device mapper on the node?
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Reopen this PR with
/reopen
- Mark this PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closedYou can:
- Reopen this PR with
/reopen
- Mark this PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@dobsonj @divyenpatel I'm wondering if there has been any progress on this front? I have a scenario where we may not be able to disable multipathd
to get the vSphere CSI driver to work and this fix seems helpful. I'd be happy to pick up where you left off here if this hasn't already been fixed in some other way.
/reopen
@jrife: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@xing-yang @divyenpatel Could you help reopen this issue? @jrife could help follow up on this.
/remove-lifecycle rotten
@dobsonj @divyenpatel I'm wondering if there has been any progress on this front? I have a scenario where we may not be able to disable
multipathd
to get the vSphere CSI driver to work and this fix seems helpful. I'd be happy to pick up where you left off here if this hasn't already been fixed in some other way.
Apologies, I've been meaning to get back to this for quite some time now.
When I first looked into this it appeared to be simply an issue with symlink evaluation in vsphere-csi-driver, but this is a broader issue for other CSI drivers as well. So while this PR does allow the mount to succeed on vsphere when multipath is enabled, I'm not sure it's the right approach.
The same problem exists for AWS EBS when multipath is enabled:
Warning FailedMount 0s kubelet MountVolume.MountDevice failed for volume "pvc-38ee8d9e-f4e9-4f52-a73b-3176e36d8f8c" : rpc error: code = Internal desc = could not format "/dev/nvme1n1" and mount it at "/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/6e5575fe544292c1e04ad5e99b185211594db1a051d6e73134b26f6ae91d0723/globalmount": format of disk "/dev/nvme1n1" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/6e5575fe544292c1e04ad5e99b185211594db1a051d6e73134b26f6ae91d0723/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)
And for GCP PD:
Warning FailedMount 4s (x8 over 72s) kubelet MountVolume.MountDevice failed for volume "pvc-79fd4955-c783-429b-ba55-c419e9859def" : rpc error: code = Internal desc = Failed to format and mount device from ("/dev/disk/by-id/google-pvc-79fd4955-c783-429b-ba55-c419e9859def") to ("/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/57c27c6d933c3e4cf2de2d1e66d73045f9da9209fb79f880de7d7712b0ae03f7/globalmount") with fstype ("ext4") and options ([]): format of disk "/dev/disk/by-id/google-pvc-79fd4955-c783-429b-ba55-c419e9859def" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/57c27c6d933c3e4cf2de2d1e66d73045f9da9209fb79f880de7d7712b0ae03f7/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)
So aws-ebs and gcp-pd both fail early, during NodeStageVolume, because the disk is already in use (by multipathd).
err = d.mounter.FormatAndMountSensitiveWithFormatOptions(source, target, fsType, mountOptions, nil, formatOptions)
However vsphere uses a different library (akutz/gofsutil
) that does not fail in NodeStageVolume:
err := gofsutil.FormatAndMount(ctx, dev.FullPath, params.StagingTarget, params.FsType, params.MntFlags...)
So the vsphere driver proceeds to format the device and then fails later in NodePublishVolume because the symlink check did not find the device in the mount table, and assumes the device was not staged after all.
We should have a consistent solution for all CSI drivers. I worry about the consequences of formatting a device that multipathd has claimed... which implies that the correct solution might be instead be to fail in NodeStageVolume as we have in aws-ebs and gcp-pd today. But that leaves us with no solution when multipathd is enabled with an empty config (which is required by some third-party drivers).
the correct solution might be instead be to fail in NodeStageVolume as we have in aws-ebs and gcp-pd today
We discussed this in the CSI meeting yesterday and agreed I should rework my PR so that the vsphere driver fails to format devices owned by multipathd in NodeStageVolume, same as aws-ebs and gcp-pd. It can be dangerous and should be prevented.
In essence, multipathd must be configured correctly on the node. If it's not, and it takes over block devices used by the CSI driver, then the CSI driver can't touch those devices.
@dobsonj Thanks for looking into this.
the correct solution might be instead be to fail in NodeStageVolume as we have in aws-ebs and gcp-pd today
We discussed this in the CSI meeting yesterday and agreed I should rework my PR so that the vsphere driver fails to format devices owned by multipathd in NodeStageVolume, same as aws-ebs and gcp-pd. It can be dangerous and should be prevented.
For my scenario I'm afraid this would be a step in the wrong direction.
In essence, multipathd must be configured correctly on the node.
This may be easier said than done. For some context, the scenario I am trying to solve for involves having the vSphere CSI driver installed alongside the NetApp CSI Driver. NetApp Trident's iSCSI driver requires a certain multipathd
configuration for each of your nodes:
defaults {
user_friendly_names yes
find_multipaths no
}
This will by default cause multipathd
to "claim" any block device you attach to your node including those provisioned by the vSphere CSI driver. In essence, NetApp's recommended configuration is in direct conflict with proper operation of the vSphere CSI driver. You can blacklist certain devices, so that multipathd
doesn't touch them,
blacklist {
device {
vendor "IBM"
product "3S42" #DS4200 Product 10
}
device {
vendor "HP"
product "*"
}
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^(td|ha)d[a-z]"
}
but this is a bit fiddly and something I'd like to avoid having to ask our users to do. It's not yet clear how easy it is to craft a one-size-fits-all config here that ignores any vSphere CSI volumes and selects any volumes NetApp is interested in.
If it's not, and it takes over block devices used by the CSI driver, then the CSI driver can't touch those devices.
I'm skeptical of the actual risk here. I agree in principle, but practically speaking in the case of the vSphere CSI driver we're formatting, attaching, mounting, etc. devices we know to be under the control of the vSphere CSI driver, not random devices on the system.
My hope would be that we could make this behavior configurable. By default, maybe it's safer to refuse to touch any devices claimed by multipathd
, but in scenarios like mine where I'm confident that there is no risk I should be able to opt into it (e.g. --resolve-device-symlinks=(true|false)
Bump @xing-yang @dobsonj
Hey @jrife, I rebased and tested a new commit to use the FormatAndMount
function from mountutils, same as the other drivers, and it did not fail in NodeStageVolume. On gcp-pd for example the mount command itself fails:
/dev/disk/by-id/google-pvc-79fd4955-c783-429b-ba55-c419e9859def is apparently in use by the system; will not make a filesystem here!
This difference between platforms is puzzling to me, but we don't necessarily need to solve that here. As far as I can tell, the symlink evaluation in GetDevMounts is safe and solves the core issue. If the second commit (to use mountutils) is controversial, I can drop it, as it has no functional impact.
This difference between platforms is puzzling to me, but we don't necessarily need to solve that here. As far as I can tell, the symlink evaluation in GetDevMounts is safe and solves the core issue. If the second commit (to use mountutils) is controversial, I can drop it, as it has no functional impact.
Hi @dobsonj. Thanks for following up. This is interesting. Do you mean that FormatAndMount
works fine on vSphere disks even when the devices are claimed by multipathd
?
Hi @dobsonj. Thanks for following up. This is interesting. Do you mean that FormatAndMount works fine on vSphere disks even when the devices are claimed by multipathd?
Yes, that's correct. From the logs with this change:
I0430 23:08:06.718501 1 mount_linux.go:517] Disk "/dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6" appears to be unformatted, attempting to format as type: "ext4" with options: [-F -m0 /dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6]
I0430 23:08:07.806132 1 mount_linux.go:528] Disk successfully formatted (mkfs): ext4 - /dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6 /var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/76bd659ef256a6255f2ec1f4728dba12451ab11d022e24398d94c2c8948a270a/globalmount
One notable difference from that log output, is that the vsphere driver formats and mounts /dev/disk/by-id/wwn-0x6000c29fd7f46ef9daf1ee03307dc9d6
which is symlinked to /dev/dm-0
, and it succeeds.
In contrast, aws-ebs uses /dev/nvme1n1
and gcp-pd uses /dev/disk/by-id/google-pvc-...
.
/unhold
@dobsonj OK, that works for me then. I have no strong opinion on the second commit here. I'll let @xing-yang and @divyenpatel chime in on the rest.
I went ahead and dropped the second commit ( https://github.com/kubernetes-sigs/vsphere-csi-driver/commit/57f8773014c5947dcce1c1040bec90965ee2791a ) since it doesn't affect this issue, it just changes how the driver calls mkfs
.
IMO, we can move forward with the GetDevMounts fix. /cc @xing-yang @divyenpatel
cc @divyenpatel
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: divyenpatel, dobsonj
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [divyenpatel]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Running pre-checkin pipelines before merging this PR.
Started Vanilla file pre-checkin pipeline... Build Number: 829
Build ID: 829
File vanilla build status: FAILURE
Stage before exit: e2e-tests
Jenkins E2E Test Results:
FAIL! -- 26 Passed | 6 Failed | 0 Pending | 816 Skipped
--- FAIL: TestE2E (4977.77s)
FAIL
Ginkgo ran 1 suite in 1h24m17.564314192s
Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-file-vanilla-precheckin/829/vsphere-csi-driver`