zfs Fix for hotplug issues

Motivation and Context

Fix for hotplug issues

Description

ZED relies on udev to match vdev guids when the device is removed. However, udev does not contain the correct blkid information for the vdev due to which the vdev failed to match when detached and we have to rely on fault handler to make the device unavailable. This PR allows the vdev to trigger a Disk Change event whenever a new vdev is added to sync blkid information with udev. This PR also changes the device state to REMOVED whenever the device is unplugged instead of UNAVAIL.

How Has This Been Tested?

By hotplugging vdevs through QEMU

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[x] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[ ] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

Aug 24 '22 18:08 ixhamza

cc: @amotin, @freqlabs

Aug 24 '22 18:08 ixhamza

For more context:

The first part of this change deals with the fact that udev caches blkid information, which includes fields parsed from vdev labels:

#  blkid --probe /dev/sde2
/dev/sde2: VERSION="5000" LABEL="storage" UUID="5081467553143690352" UUID_SUB="8098199377452528262" BLOCK_SIZE="4096" TYPE="zfs_member" USAGE="filesystem" PART_ENTRY_SCHEME="gpt" PART_ENTRY_UUID="550c557e-c90e-11e9-bd70-ac1f6b0adb04" PART_ENTRY_TYPE="516e7cba-6ecf-11d6-8ff8-00022d09712b" PART_ENTRY_NUMBER="2" PART_ENTRY_OFFSET="4194432" PART_ENTRY_SIZE="1949330696" PART_ENTRY_DISK="8:64"
# udevadm info /dev/sde2 | grep ID_FS_
E: ID_FS_VERSION=5000
E: ID_FS_LABEL=storage
E: ID_FS_LABEL_ENC=storage
E: ID_FS_UUID=5081467553143690352
E: ID_FS_UUID_ENC=5081467553143690352
E: ID_FS_UUID_SUB=8098199377452528262
E: ID_FS_UUID_SUB_ENC=8098199377452528262
E: ID_FS_TYPE=zfs_member
E: ID_FS_USAGE=filesystem

The LABEL field is the pool name, UUID is the pool guid, UUID_SUB the vdev guid, among other fields. A change event on the block device is needed for udev to invalidate its cache and re-probe the device whenever we update these in a vdev label, for example when we create/reguid/destroy a pool or add/replace/remove vdevs. Otherwise udev will report incorrect guids to zed and zed will ignore events for the device.

The second part of this change deals with device removal. There exists VDEV_STATE_REMOVED for a device that has been detached from the system. Both FreeBSD and Linux already use this state when an I/O failure indicates the device is gone. On FreeBSD there is also a callback in the kernel when a device is removed, which we use to update the state accordingly without having to wait for I/O to fail. There is no such mechanism on Linux, so we have to rely on udev/zed. There is currently no way to request VDEV_STATE_REMOVED from userland, so it has been added to libzfs using the existing ZFS_IOC_VDEV_SET_STATE ioctl, and zed now uses this to mark the vdev as removed when it receives the removal event from udev.

Aug 24 '22 20:08 ghost

Couple of thoughts before I take a look at the code:

We've debated in the past whether or not to remove a vdev from a pool on a udev remove event. On the surface it seems perfectly logical. udev tells you the disk has been removed, so remove it from the pool. Makes sense.

However, at very large scale (1000s of disks), you often see odd disk behavior that you have to account for. Like, multiple NVMe drives, on multiple nodes, all simultaneously disappearing and re-appearing within a few seconds of each other. I've seen SAS disks disappear/re-appear as well. And as long as the disks come back within the IO timeout amount of time, and are healthy after they come back, then the user wouldn't notice anything (maybe even more so if you're running Lustre on top of ZFS, with it's own timeouts/retries on top of it?). Just to quantify this strangeness a little - I'm seeing over 1000 zed "EC_dev_remove" events across all our disks in the last month (of which only a handful are actual disk replacements). Had this "remove vdev on udev remove" been enabled, we would have seen tons of pools fault at the same time, causing mass downtime and admin intervention. Since we didn't have it enabled, there were no issues - all the IOs rode though the intermittent period of disks disappearing/re-appearing.

In short, I think you should make this configurable as a new autoremove pool property that is off by default. We already have an autoreplace property, so it makes sense to have an autoremove property too.

Also, I'm not in favor of the UNAVAIL -> REMOVED rename. "UNAVAIL" already has a lot of historical weight (just google "ZFS unavail"), and it might break scripts if we rename it. UNAVAIL also implies that the drive could come back to life later, which would certainly be accurate in the case of our disappearing/re-appearing drives.

Aug 26 '22 00:08 tonyhutter

@tonyhutter From that perspective I think it would be reasonable for ZED to just not make too quick extra movements like kicking in spares for few minutes after disk disappeared just in case it return. Any way rebuild will likely take hours, so few minutes do not matter much. Pardon me if that logic is already there. But if disk really disappeared according to the block level, it won't handle any new I/O, so hiding it from ZFS is not very productive. On opposite, we saw that disk references held by ZFS cause actual problems for NVMe and surrounding subsystems, that can't free the resources.

We also have a huge fleet of ZFS systems, having may be not multiple thousands, but up to a thousand disks each, just based on FreeBSD. And on FreeBSD disk disappearance is immediately reported to ZFS straight within kernel without any additional daemon participation. Daemon handles only replacement. And we feel good enough about that.

Aug 26 '22 01:08 amotin

We found more issues with removal detection to be fixed.

Aug 29 '22 14:08 ghost

@amotin

From that perspective I think it would be reasonable for ZED to just not make too quick extra movements like kicking in spares for few minutes after disk disappeared just in case it return.

Yea, something like an "autoremove after N seconds" timer would be good. We would probably want the autoreplace timer to be longer than the IO timeout (plus the IO retry). The autoremove timer could be added in a future PR.

Aug 29 '22 17:08 tonyhutter

Physically removed vdevs are already supposed to transition to VDEV_STATE_REMOVED on IO errors even on Linux, but it doesn't work for several reasons.

zfs_check_media_change() for kernels 5.10+ has a bug: it always returns 0.

Beyond that, the bdev_check_media_change() KPI it uses doesn't work to check for a hot-plugged device. For NVMe, the check_events member of struct block_device_operations isn't implemented, so it can't work at all. For the drivers that do implement it, it can only work if the device identifies as removable (/sys/block/*/removable). For example, in a few systems I have handy, a USB flash drive is removable, but a SATA disk in a hot-swap enclosure attached to a SAS controller is not.

This would affect older kernels with check_media_change() as well. So in most cases the check to see if a device was removed just isn't working as intended.

UPDATE: even SAS disks report 0 for removable. This ever working would be very rare indeed.

Aug 29 '22 21:08 ghost

@freqlabs I just added an updated version and split my work into multiple commits. cc: @amotin

Sep 01 '22 13:09 ixhamza

I tested this a little using a draid1:4d:9c:1s pool of 9 NVMe drives. I tested powering off/on a NVMe drive using /sys/bus/pci/slots/<slot>/power and all seemed to work as expected. I plan on doing some more real world testing next week with spinning disks in a SAS enclosure.

Sep 09 '22 00:09 tonyhutter

I just did another test today using multipathed spinning disks in a SAS enclosure. I first created a three disk mirror pool, then I removed a disk while the pool was idle. When I removed the disk the pool status did not change... which is the expected behavior considering these are multipath'd disk where the DM device doesn't necessarily go away or post a remove event when you pull out a disk. So that's good. I then wrote to the pool, and the pool correctly reported the disk as UNAVAIL and reported IO errors.

Sep 14 '22 23:09 tonyhutter

Next to test:

Remove log device
Remove cache device
Remove spare
Remove file-backed vdev
Remove faulted vdev

Sep 15 '22 01:09 tonyhutter

Further testing on a VM shows:

regular vdevs (faulted or not) correctly get marked as REMOVED on removal.
log devices (faulted or not) correctly get marked as REMOVED on removal.
spares status doesn't change on removal.
cache device status doesn't change on removal
File-backed vdevs status doesn't change on removal

I don't know if 5 can ever be fixed, since you won't see a udev event for their deletion. 3-4 should be made to work though.

Sep 16 '22 21:09 tonyhutter

@tonyhutter thank you so much for testing these patches. Since spare vdevs do not have a label, I think it would require some extra work to support hotplugging for spares, maybe in future PR. I tested hotplug for cache vdev on qemu with emulated nvme device though and it was working during my testing but maybe I missed something. Can you please share details of your test for cache vdevs?

Sep 16 '22 21:09 ixhamza

Can you please share details of your test for cache vdevs?

I created two virtual disks in virt-manager, made a pool, and removed the one that was the cache:

$ sudo ./zpool status
  pool: tank
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  vde       ONLINE       0     0     0
	cache
	  vdf       ONLINE       0     0     0

errors: No known data errors

# Verify cache device exist before removal
$ lsblk /dev/vdf
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
vdf    252:80   0    1G  0 disk 
├─vdf1 252:81   0 1014M  0 part 
└─vdf9 252:89   0    8M  0 part 

# Remove cache device in virt-manager

# cache device has been removed
$ lsblk /dev/vdf
lsblk: /dev/vdf: not a block device

$ sudo ./zpool status
  pool: tank
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  vde       ONLINE       0     0     0
	cache
	  vdf       ONLINE       0     0     0

errors: No known data errors

Sep 16 '22 22:09 tonyhutter

@tonyhutter Thank you for your description. I debugged the issue and it seems to be related to virtio disks. Vdev should have devid "/dev/disk/by-id", so it can be identified during removal. However, by default virtio disks do not seem to have devid unless the serial attribute is hardcoded for the specified disk (https://github.com/systemd/systemd/issues/17670). Even if we provide the serial attribute for virtio disks, zed won't be able to find devid since there is no handling for virtio disks in zfs_device_get_devid(). If you test this on real hardware or emulated disks, it should not be a problem. Normal vdevs won't be affected by this since they are identified by pool_guid and vdev_guid. Maybe I can dig more into it in future. @freqlabs

Sep 18 '22 18:09 ixhamza

@ixhamza - I went back to testing on real-world NVMe devices and have some new results. Good news is that the cache device now correctly shows REMOVED on removal:

# zpool status
  pool: tank
 state: ONLINE
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    ONLINE       0     0     0
	spares
	  nvme1n1    AVAIL   

# echo 0 > /sys/bus/pci/slots/0/power 
# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    REMOVED      0     0     0
	spares
	  nvme1n1    AVAIL

The bad news is that the spares do not show REMOVED:

# echo 0 > /sys/bus/pci/slots/1/power 
# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
config:

	NAME         STATE     READ WRITE CKSUM
	tank         ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    nvme2n1  ONLINE       0     0     0
	    nvme3n1  ONLINE       0     0     0
	logs	
	  nvme5n1    ONLINE       0     0     0
	cache
	  nvme0n1    REMOVED      0     0     0
	spares
	  nvme1n1    AVAIL   

errors: No known data errors

# file /dev/nvme1n1
/dev/nvme1n1: cannot open `/dev/nvme1n1' (No such file or directory)

I also verified that in both removal cases (the cache and spare), that udev remove events were generated. So ZED should be able to key off those removal events for spares as well, although it might require some additional code.

Sep 20 '22 18:09 tonyhutter

@tonyhutter you are right, hotplugging of spare is indeed not supported in this patch since it does not contain a label that we use to identify other vdevs. @freqlabs - do you think this is something we may cover in future work as it does not seem to be very straightforward to cover in this PR with reference to our earlier discussion?

Sep 20 '22 18:09 ixhamza

Yes, we can address handling spares next.

Sep 20 '22 18:09 ghost

@behlendorf, @tonyhutter It would be appreciated if you guys can provide feedback or tag relevant guys here to review this PR. These patches are very critical for our TrueNAS BETA release. @amotin, @freqlabs

Sep 26 '22 18:09 ixhamza

@ixhamza apologies for the delay, I'll take a look right now

Sep 26 '22 19:09 tonyhutter

@ixhamza it looks good - I don't really don't have any more comments. I would recommend you squash your commits and use the (slightly modified) commit message:

zed: mark disks as REMOVED when they are removed

ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be propertly marked as REMOVED.

Sep 26 '22 20:09 tonyhutter

@tonyhutter Thank you so much for your review and for verifying it on your hardware. I have incorporated your feedback into this PR.

Sep 26 '22 23:09 ixhamza

I've offered a few more suggestions in private for Ameer to incorporate, including fixing a nearby memory leak when "remove" events are ignored.

Sep 27 '22 20:09 ghost

Thank you @freqlabs. I have incorporated your suggested changes.

Sep 27 '22 21:09 ixhamza