zfs icon indicating copy to clipboard operation
zfs copied to clipboard

zpool commands block when a disk goes missing / pool suspends

Open gordan-bobic opened this issue 10 years ago • 47 comments
trafficstars

It would appear that pulling a disk from a single disk pool causes ZoL to get into a state where all zpool commands (e.g. zpool list) to block. sync also blocks indefinitely. Both become uninterruptable (kill -9 doesn't work).

No other errors in dmesg other than the disk getting disconnected (I removed it) and:

WARNING: Pool 'poolname' has encountered an uncorrectable I/O failure and has been suspended.

There need to be timeouts and the failure handling of this scenario needs to be more graceful than requiring reboot of the machine.

gordan-bobic avatar May 30 '15 07:05 gordan-bobic

This isn't a case of a time out. ZFS knows the disk is gone. This is a deliberate choice by the ZFS developers. If your pool can't survive due to redundancy failures it enters a suspended state to allow the administrator the ability to fix it while dirty data, etc are still in RAM. zpool set failmode=continue poolname allows some operations to fail with IO errors rather than jumping to suspended mode ASAP, but some actions will still suspend the pool.

What might make sense is modifying some zpool tools to be aware of suspended pools and avoid querying data that may require disk IO to find.

Here's a sample pool I faulted using blkdiscard on an SSD with failmode=continue

  pool: testpool
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-JQ
  scan: none requested
config:
    NAME        STATE     READ WRITE CKSUM
    testpool    UNAVAIL      0     0     4  insufficient replicas
      vdb1      UNAVAIL      0     0    16  corrupted data
# zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
testpool  49.8G  10.1M  49.7G         -     0%     0%  1.00x  UNAVAIL  -

DeHackEd avatar May 30 '15 12:05 DeHackEd

So why do sync and zpool list also hang forever? There are other pools in the machine, and it is a zfs-only machine. Hanging until the machine is rebooted just seems downright silly. For this behaviour to make any sense there has to be a way to either tell ZFS to give up on a pool and/or unsuspend the pool.

gordan-bobic avatar May 30 '15 12:05 gordan-bobic

This is intended behaviour to safeguard your dirty data and allow an administrator the chance to fix a broken machine while hot data is still available. The sync can't complete because there is insufficient disks to write out the transaction. zpool list hanging could be interpreted as a bug - unknown field values could be replaced with just a - dash and only display pool name and health (FAULTED, UNAVAIL, etc).

If you can fix the problem, ZFS will pick up where it left off and unsuspend.

DeHackEd avatar May 30 '15 12:05 DeHackEd

Except that when it all hangs it is not possible at that point to zpool set failmode=continue poolname

There really needs to be a way for the administrator to say "fine, go fail" without requiring a complete reboot - especially when that reboot actually requires a hard-reset because the shutdown sequence will also hang when trying to cleanly export the pools.

gordan-bobic avatar May 30 '15 13:05 gordan-bobic

@gordan-bobic the situation should be improved somewhat in the 0.6.4.1 tag. At a minimum command which do not require disk IO should be allowed to complete.

For example, commands like zpool list which should not have been impacted but were because parts of the output required accessing the disk. The very disk which was no longer available. Specifically this was a regression which crept in with feature flag support, see 417104bdd3c7ce07ec58674dd078f9891c3bc780 for details. However, command list zfs list potentially won't because some dataset needs may need to be read from disk.

The fact that you can't always CTRL-C the command does definitely sound like a bug.

behlendorf avatar Jun 01 '15 19:06 behlendorf

@behlendorf imho that zfs list could block is a bug: it is read-only, and fundamental to zfs.

Data that is needed for zfs list should be always be in ARC and stay there till the pool is exported.

Ideal would be that the metadata needed for this is loaded on pool import and never ever be released (at least an option to configure such behaviour should be available).

More on topic: Block operations that require r/w access to a suspended pool (apart from listing it) seems reasonable, nevertheless they should always be abortable by a signal and clean up correctly so that they don't leave dangling locks. Blocking operations on healthy pools just because another pool in the system failed and/or reaching a system state that can only be cured by a reboot is imho a bug,so there should be a way to cleanly (and completely, so a reimport would be possible) remove a suspended pool from the system.

GregorKopka avatar Jun 02 '15 12:06 GregorKopka

@GregorKopka regarding zfs list that's potentially a lot of data. There could be 100,000's of datasets when you start including snapshots. Plus you'd need to store the properties for all of them, it has the potential to consume a significant chunk of memory. I could see this potentially being a configurable thing.

On the other points I generally agree. But the devils always in the details for these things so someone will need to investigate why it is the way it is.

behlendorf avatar Jun 05 '15 16:06 behlendorf

i confirm, when an usb device miss (due to hard un-connection), services freeze, zfs and zpool are no more been usable (from antergos zfs packages, but i think also from others), and CTRL+C or CTRL+Z not respond. Ugly... need to reboot... at this time and because of this, ZFS is not a stable/strong file system format and could not safety be use with external hard drive.

jerome-diver avatar Apr 28 '16 10:04 jerome-diver

duplicate of #3256 ?

mailinglists35 avatar Jun 18 '16 01:06 mailinglists35

not really, this subject is specific to 'disk removed (without to be wanted for)'. ZFS dev team need to have consideration about material reality that can failed or be removed on this real factual world, that's real situations that does happened by the way to everybody. Close it and target on the other subject is like to try to not see the issue origin. A disk that has been unconnected, has not to be a crucial problem that can make all your data loosed, because the world is not perfect and ZFS has to works in this real world (and sure... it is not so simple to look on this way because reality is not theorical things). Also, many other file system take care about this reality. i think this post would be close when ZFS would be able to be stable and strong on this real world situations, and specifically when a drive has been hard unconnected.

jerome-diver avatar Jul 06 '16 06:07 jerome-diver

@jerome-diver you can set the failmode=continue zpool property to prevent the pool from suspending when the drive is hard removed from a non-redundant configuration. This will result in errors to the applications but it should not hang the system. This behavior is similar to what you'd get from other filesystems.

There is some related work under way to better be able to detect when a drive was removed and if it's readded to the system what the new device name is.

behlendorf avatar Jul 11 '16 20:07 behlendorf

@DeHackEd

This is intended behaviour to safeguard your dirty data and allow an administrator the chance to fix a broken machine while hot data is still available.

how do you fix the broken machine when all you want is to force export/clear the pool that has experienced the errors? without rebooting the machine and without affecting other pools. you can't even rmmod -f the zfs modules, once a pool goes in suspended state, it never gets out of suspended state unless you reboot the machine.

@behlendorf Jun 5, 2015

But the devils always in the details for these things so someone will need to investigate why it is the way it is.

It's been a year and a half, is there any progress on someone investigating? Is there any hope anytime soon zfs will allow exporting/clearing/removing from memory a pool that is in suspended state (issue https://github.com/zfsonlinux/zfs/issues/5242)?

@behlendorf

you can set the failmode=continue zpool property to prevent the pool from suspending when the drive is hard removed from a non-redundant configuration.

I have already set the failmode=continue on a single disk pool over iscsi and yet the pool is hung and the only way to get out of this is a reboot, even if the drive came back online, see https://github.com/zfsonlinux/zfs/issues/3256. additionally, the zpool status message shows this url which has a 404 http://zfsonlinux.org/msg/ZFS-8000-JQ).

This behavior is similar to what you'd get from other filesystems.

no, it's not similar. on ext4 I can unmount the affected device then fsck then remount the filesystem once the device is reconnected to the system. on zfs, the pool remains hung forever.

There is some related work under way to better be able to detect when a drive was removed and if it's readded to the system what the new device name is.

are you referring to https://github.com/zfsonlinux/zfs/pull/5343 ? if yes, this will allow unsuspending pools if the disk comes back online?

mailinglists35 avatar Dec 15 '16 11:12 mailinglists35

There really needs to be a way to instruct ZFS to throw away any and all dirty data and forget that the pool was ever here without rebooting the machine. Leaving a pool in a hung state with the disk removed is of no practical use. If there is risk of trashing the pool, so be it, but that risk doesn't seem any different from what happens if you reboot the machine, which is currently the only option anyway.

gordan-bobic avatar Dec 15 '16 12:12 gordan-bobic

This is a serious problem here, where we operate on the same machine its normal pools (on internal SATA drives) and backup/archival pools which reside on external HDDs connected to the machine via USB3. Everytime we have a USB3 connection "flicker" (which is frequent as the connectors are not as reliable as we want) its pool is suspended, and all other pools start experimenting the "hanging pool commands" syndrome, and the only way to fix it is to reboot the machine, interrupting everything else that was being done on it. To live with ZFS's current way of handling this are forced to allocate an entire machine to the sole purpose of connecting USB-based pools to it, and then accessing these pools over the network from other machines, which is much slower and generally inefficient.

DurvalMenezes avatar Apr 07 '18 14:04 DurvalMenezes

I ran into the same problem tonight on an Ubuntu 16.04 machine. I'm running a backup script to send/receive snapshots to single backup disks. Because of a typo in the backup script, the command zpool export backup01 did not run before echo 1 > /sys/block/sde/device/delete was issued. After pulling the disk out of the tray all zpool commands started to hang. Putting the drive back in did not cure the situation. The drive re-appeared as /dev/sdi btw.

Rayn0r avatar Sep 24 '18 19:09 Rayn0r

I have the same problem with d state processes with no solution yet... See the "closed" issue https://github.com/zfsonlinux/zfs/issues/3667 .

remyd1 avatar Jan 28 '19 13:01 remyd1

There are unfortunately a number of other issues related to this one. As mentioned in the comments, I've been treating #5242 as the "main" issue. There are also a handful of related, but different problems in this area which have been reported.

This particular issue seems to be concentrated on the case in which a pool has become suspended via a device removal and can't be un-suspended after the device is made available again. I'll note that it definitely should be possible to continue using a pool which has been suspended due to a missing device so long as the device can be made available again. Once the device is made available again, a zpool clear should bring the pool back. However, there's currently a bug with the process: if the device doesn't come back at the same physical path in "/dev", the process doesn't work. For example, if I run a test with a USB stick at "/dev/sdc", make a pool on it, do a unplug/plug cycle, it will normally come back as "/dev/sdd". The "clear" logic tries to (re)open the device using only the path it knows about; this is a problem. In this case, the problem can be worked around by making some symlinks pointing at the new device. However, the problem, and this particular bug, can be worked around by importing the pool using stable paths (i.e. zpool import -d /dev/disk/by-id tank) because when the device re-appears, the stable path will be created again. When this happens, a zpool clear will bring the pool back to operational state. I did just test all this on current master code and it works just fine.

The other related issues such as a "lost IO request", suspension due to device removal in which the device can't be made to re-appear (due to, broken device, dodgy driver, etc.) are what I've been concentrating on (particularly the case in which an IO request is "lost" somehow) fixing. I'm pretty sure I outlined my work presented at the 2018 OpenZFS summit hackathon in one of these related issues. Quick summary on this related work: the work is somewhat in stasis, waiting for TRIM to be committed at the very least.

dweeezil avatar Jan 28 '19 15:01 dweeezil

I'd like to drag this one back out as something that'd be nice to have fixed.

In my scenario, I use a separate disk externally as a single disk pool that I use to backup my other pool. If this drive is yanked out before exporting the pool, I get a kernel panic and can no longer get access to any pools on my machine as "zpool" and "zfs" commands just freeze.

I'm unfortunately not technical at all with ZFS under the hood, but I use it a lot and this seems to be a serious bug, no?

marker5a avatar Feb 28 '20 03:02 marker5a

Hello,

Just wanted to share my experience I had last week which very much relates to this topic (but not the experience). I had both of my pools suspended with status insufficent replicas, restore pool from backup... In a "mishap" I ran udevadm trigger with none of my disk aliases present in vdev_id.conf = none of my disks available from ZFS point of view. I quickly restorted my vdev_id.conf, zpool export / import and zpool clear and back in business. No reboot needed.

Running Ubuntu 18.04.4 LTS / ZFS 0.8.3.

jilted82 avatar Feb 28 '20 08:02 jilted82

@jilted82, I think your case was different from what is being reported here: you had a "logical" problem (missing vdev_id.conf), we are talking about physical problems (ie, a physical device that ZFS was already using, physically disappearing from the system).

DurvalMenezes avatar Mar 04 '20 12:03 DurvalMenezes

Yeah, I'm having the same thoughts... Seems to be a different set of circumstances on my end. Not sure what other info I can provide other than:

Zfs version: 0.8.3-1 Distro: Arch Linux Kernel: 5.4.15-1-ck

marker5a avatar Mar 04 '20 14:03 marker5a

@jilted82, I think your case was different from what is being reported here: you had a "logical" problem (missing vdev_id.conf), we are talking about physical problems (ie, a physical device that ZFS was already using, physically disappearing from the system).

I should clarify that I do realize that my scenario is different from what is beeing addressed in this thread. My intention was to share my experience from a somewhat similar scenario in case that would be of any help to the developers :)

jilted82 avatar Mar 09 '20 21:03 jilted82

the only way to recover without reboot is to insert linux device mapper between zfs and physical storage.

when sh*t hits the fan, you replace the dm table with dm-error table, then a next zpool/zfs command will fail instead of hang. (see in this comment how to do it https://github.com/openzfs/zfs/issues/5242#issuecomment-529876483)

you still cannot completely remove a suspended pool from memory but at least you can keep working.

mailinglists35 avatar Mar 09 '20 22:03 mailinglists35

Hi,

I have this problem again today. Problem is that I cannot use hotplug disks in order to change a failed zpool status while letting other zpools continue to work as usual... This is really annoying.

Kind regards

remyd1 avatar Mar 17 '20 12:03 remyd1

yes you can.

insert linux device mapper between zfs and physical device.

when it fails, replace the dm-linear table with dm-error table, swap the disk, then re-create the dm-linear table.

it does not require destroying and recreating pool

it’s an ugly workaround but it keeps you running

On Mar 17, 2020 at 2:30 PM, <DERNAT Rémy (mailto:[email protected])> wrote:

Hi,

I have this problem again today. Problem is that I cannot use hotplug disks in order to change a failed zpool status while letting other zpools continue to work as usual... This is really annoying.

Kind regards

— You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/openzfs/zfs/issues/3461#issuecomment-600044189), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AAPVRHUV2CPER7GGTIAZMOLRH5UNBANCNFSM4BGBDYUQ).

mailinglists35 avatar Mar 18 '20 16:03 mailinglists35

Hi @mailinglists35

I understood what you said, but I wasn't able to do it myself.

Then, I saw your recover script.

Do you think it could be a bit more automated (I saw references to physical ata IDs), or better packaged and more tested ?

Thanks, Best regards, Rémy

remyd1 avatar Mar 20 '20 10:03 remyd1

I am afraid I have no resources available for that.

I made the finding available for anyone who is interested to solve the annoyance, but it is not intended to be something I support.

I am satisfied that it even works... and happy if it helps somebody else, but they must work to adapt it to their configuration.

Also be aware that I use dmesg at some point, which could or not be already flooded by some other sources - you can use journalctl -kb which retains full dmesg logs since boot

On Mar 20, 2020 at 12:57 PM, <DERNAT Rémy (mailto:[email protected])> wrote:

Hi @mailinglists35 (https://github.com/mailinglists35)

I understood what you said, but I wasn't able to do it myself.

Then, I saw your recover script (https://gist.github.com/mailinglists35/65cf2f165f543243157c2aa573e75a49#gistcomment-3016376).

Do you think it could be a bit more automated (I saw references to physical ata IDs), or better packaged and more tested ?

Thanks, Best regards, Rémy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://github.com/openzfs/zfs/issues/3461#issuecomment-601640657), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AAPVRHXUCZT2GHYLZNILJMTRINDX5ANCNFSM4BGBDYUQ).

mailinglists35 avatar Mar 23 '20 01:03 mailinglists35

New old storry time around critical bugs on ZFS storry time

4 years after

4 years after, this major critical bug who makes ZFS unusable in some situations and push this promising (old/young) file system in a "not safe" file system due to data loose ability, make me feel unhappy to not be able to seriously consider to use it as long as this critical serious bug is not resolve.

Consideration about what is the next step to fix bug

I think the dev team of ZFS (maybe not) doesn't understand the priority of safety things around this crucial problem. I think that because i think they should be able to fix the problem because they can be very good.

Priorities choice

Nevermind, increase ZFS to be very stable and safe seems to not be a priority (i can think this after 4 years of not resolved crucial bug), there is some other great file systems... I also understand that for some business, priorities is to propagate/invest in some great propaganda commercials feeds news (against fake news for sure... everybody know that). that is the reality of this world, isn't it ?

Repair with tape the "best file system"... really ?

I also appreciate some solution offer by some other users who do experiment the problem with tricky method who make system more complicate to administrate by add an other one layer. They are really good to find tapes to fix things and this is very fantastic by the way. I lover D system in my garden. This push back ZFS inside my third level considered fs for use non important data i can loose and play to test with. Thank you for that guys...

KISS and safe for a long

I'm choosing to keep it simple stupid and don't use tricks around nor unfinished files systems to trust with for hold my precious data.

Best wishes and lovely mind thinking for the futur

Hope for ZFS the best for this year and next other by considering to concentrate on "priorities first". And to be clear that priorities should maybe be "safety first" and strong pattern.

Let's see on this next story time. To be continued...

jerome-diver avatar Mar 27 '20 08:03 jerome-diver

apparently work is being done in https://github.com/openzfs/zfs/pull/11082 to fix this

mailinglists35 avatar Nov 05 '20 00:11 mailinglists35

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 05 '21 04:11 stale[bot]