'zpool import' hangs, `zpool import -o readonly-true` succeeds
System information
Distribution Name | Ubuntu Distribution Version | Noble Kernel Version |6.8.0-58-generic Architecture | x86_64 OpenZFS Version |zfs-2.2.2-0ubuntu9.2
Describe the problem you're observing
A few weeks ago, after a routine reboot, one of our zpools would hang when imported from the cache file. The zpool is about 1.5 PB in size and we estimate it had ~50k datasets and snapshots on it.
After hanging for ~10 days, we rebooted in emergency mode and looked for disk failures. When we saw no failed disk, we continued to boot into multi-user mode and the zpool came up within ~5 minutes. zpool status indicated reported all disks as healthy, and that a scrub was in progress but not making progress. When re-issuing the zpool scrub command, we discovered that the zpool was unexpectedly mounted read-only.
At this point, we rebooted again to confirm that the zpool would come up reliably (and hopefully in rw mode). However, this time the import hang again. After ~24h we upgraded kernel and operating system to the versions shown above (removing the cachefile in the process). At this point we got the following output:
# zpool import -o readonly --rewind-to-checkpoint
pool: backup
id: [...]
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
'compatibility' property is set.)
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit 'zpool upgrade'.
config:
backup ONLINE
raidz3-0 ONLINE
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy0-lun-0 ONLINE
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy10-lun-0 ONLINE
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy11-lun-0 ONLINE
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy12-lun-0 ONLINE
[...]
raidz3-1 ONLINE
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy40-lun-0 ONLINE
[...]
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy8-lun-0 ONLINE
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy9-lun-0 ONLINE
pci-0000:b1:00.0-sas-exp0x5000ccab0549b8fb-phy0-lun-0 ONLINE
raidz3-2 ONLINE
[...]
smartctl reported a health status for every disk as well.
However, when attempting to import the pool, we got either errors:
# zpool import -o readonly --rewind-to-checkpoint -a -f
cannot import 'backup': checkpoint does not exist
Destroy and re-create the pool from
a backup source.
# zpool import -o readonly --rewind-to-checkpoint -f backup
cannot import 'backup': checkpoint does not exist
Destroy and re-create the pool from
a backup source.
..or the import command hung again:
# zpool import -o readonly -F -f backup
This time, after about 6 hours, we got the following message in the kernel logs:
kernel: WARNING: Pool 'backup' has encountered an uncorrectable I/O failure and has been suspended.
However, smartctl still listed all disks as healthy, and there were no other messages indicating which disk (if any) has been affected.
At this point we are not sure if we have found a bug in ZFS, or if we are dealing with a faulty disk. Any recommendations how to distinguish between the two (and potentially identify which disk is faulty) would be appreciated.
Some more information:
Importing with -o readonly=yes works (taking about 5 minutes). Only importing read-write hangs.
blktrace shows a steady stream of I/O requests to the disk, but none of them seem to be originating from the zpool command, they're all from z_rd_iss and txg_sync, e.g.
8,192 15 25 2.601617543 22136 A R 33974611072 + 8 <- (8,193) 33974609024
8,192 15 26 2.601617651 22136 Q R 33974611072 + 8 [z_rd_iss]
8,192 15 27 2.601618089 22136 G R 33974611072 + 8 [z_rd_iss]
8,192 15 28 2.601619294 22136 P N [z_rd_iss]
8,192 15 29 2.601619495 22136 U N [z_rd_iss] 1
8,192 15 30 2.601619714 22136 I R 33974611072 + 8 [z_rd_iss]
8,192 15 31 2.601620779 22136 D R 33974611072 + 8 [z_rd_iss]
8,192 15 32 2.604587471 0 C R 33974611072 + 8 [0]
67,224 13 32 2.565621543 0 C R 27258321760 + 8 [0]
66,208 15 18 2.565644350 22136 A R 34960332984 + 8 <- (66,209) 34960330936
66,208 15 19 2.565644473 22136 Q R 34960332984 + 8 [z_rd_iss]
66,208 15 20 2.565645002 22136 G R 34960332984 + 8 [z_rd_iss]
66,208 15 21 2.565646399 22136 P N [z_rd_iss]
66,208 15 22 2.565646593 22136 U N [z_rd_iss] 1
66,208 15 23 2.565646838 22136 I R 34960332984 + 8 [z_rd_iss]
66,208 15 24 2.565648775 22136 D R 34960332984 + 8 [z_rd_iss]
66,208 15 25 2.570782977 0 C R 34960332984 + 8 [0]
68,64 13 17 2.581605664 25065 A R 4951839112 + 8 <- (68,65) 4951837064
68,64 13 18 2.581605813 25065 Q R 4951839112 + 8 [txg_sync]
68,64 13 19 2.581606674 25065 G R 4951839112 + 8 [txg_sync]
68,64 13 20 2.581608016 25065 P N [txg_sync]
68,64 13 21 2.581608261 25065 U N [txg_sync] 1
68,64 13 22 2.581608494 25065 I R 4951839112 + 8 [txg_sync]
68,64 13 23 2.581609890 25065 D R 4951839112 + 8 [txg_sync]
68,64 13 24 2.586092225 0 C R 4951839112 + 8 [0]
66,240 13 25 2.586114792 22139 A R 7812989376 + 8 <- (66,241) 7812987328
66,240 13 26 2.586114913 22139 Q R 7812989376 + 8 [z_rd_iss]
66,240 13 27 2.586115617 22139 G R 7812989376 + 8 [z_rd_iss]
66,240 13 28 2.586116749 22139 P N [z_rd_iss]
66,240 13 29 2.586116949 22139 U N [z_rd_iss] 1
while
# blktrace -I ~/harddisks.txt -a complete -o - | blkparse -i - | grep zpool
[nothing for 900+ seconds]
The stack trace of the stuck zpool command does not seem to change:
Jun 09 12:17:54 kernel: task:zpool state:D stack:0 pid:15721 tgid:15721 ppid:15469 flags:0x00004002
Jun 09 12:17:54 kernel: Call Trace:
Jun 09 12:17:54 kernel: <TASK>
Jun 09 12:17:54 kernel: __schedule+0x27c/0x6b0
Jun 09 12:17:54 kernel: ? try_to_wake_up+0x11b/0x640
Jun 09 12:17:54 kernel: schedule+0x33/0x110
Jun 09 12:17:54 kernel: io_schedule+0x46/0x80
Jun 09 12:17:54 kernel: cv_wait_common+0xab/0x140 [spl]
Jun 09 12:17:54 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Jun 09 12:17:54 kernel: __cv_wait_io+0x18/0x30 [spl]
Jun 09 12:17:54 kernel: txg_wait_synced_impl+0x105/0x150 [zfs]
Jun 09 12:17:54 kernel: txg_wait_synced+0x10/0x60 [zfs]
Jun 09 12:17:54 kernel: spa_config_update+0x152/0x1a0 [zfs]
Jun 09 12:17:54 kernel: spa_import+0x5c3/0x670 [zfs]
Jun 09 12:17:54 kernel: zfs_ioc_pool_import+0x163/0x180 [zfs]
Jun 09 12:17:54 kernel: zfsdev_ioctl_common+0x596/0x6a0 [zfs]
Jun 09 12:17:54 kernel: ? __check_object_size.part.0+0x72/0x150
Jun 09 12:17:54 kernel: zfsdev_ioctl+0x57/0xf0 [zfs]
Jun 09 12:17:54 kernel: __x64_sys_ioctl+0xa0/0xf0
Jun 09 12:17:54 kernel: x64_sys_call+0x12a3/0x25a0
Jun 09 12:17:54 kernel: do_syscall_64+0x7f/0x180
Jun 09 12:17:54 kernel: ? handle_pte_fault+0x1cb/0x1d0
Jun 09 12:17:54 kernel: ? __handle_mm_fault+0x654/0x800
Jun 09 12:17:54 kernel: ? __count_memcg_events+0x6b/0x120
Jun 09 12:17:54 kernel: ? count_memcg_events.constprop.0+0x2a/0x50
Jun 09 12:17:54 kernel: ? handle_mm_fault+0xad/0x380
Jun 09 12:17:54 kernel: ? do_user_addr_fault+0x333/0x670
Jun 09 12:17:54 kernel: ? irqentry_exit_to_user_mode+0x7b/0x260
Jun 09 12:17:54 kernel: ? irqentry_exit+0x43/0x50
Jun 09 12:17:54 kernel: ? clear_bhb_loop+0x15/0x70
Jun 09 12:17:54 kernel: ? clear_bhb_loop+0x15/0x70
Jun 09 12:17:54 kernel: ? clear_bhb_loop+0x15/0x70
Jun 09 12:17:54 kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
Jun 09 12:17:54 kernel: RIP: 0033:0x7078c0f24ded
Jun 09 12:17:54 kernel: RSP: 002b:00007ffd074bd670 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 09 12:17:54 kernel: RAX: ffffffffffffffda RBX: 00005e250de84c30 RCX: 00007078c0f24ded
Jun 09 12:17:54 kernel: RDX: 00007ffd074be030 RSI: 0000000000005a02 RDI: 0000000000000003
Jun 09 12:17:54 kernel: RBP: 00007ffd074bd6c0 R08: 0000000000000000 R09: 0000000000000000
Jun 09 12:17:54 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00005e250dbef4e0
Jun 09 12:17:54 kernel: R13: 00007ffd074be030 R14: 00005e250dc262d0 R15: 00005e250dbfa150
Jun 09 12:17:54 kernel: </TASK>
Found a few more things that might be helpful:
# cat /proc/spl/kstat/zfs/import_progress
pool_guid load_state multihost_secs max_txg pool_name
# cat /proc/spl/kstat/zfs/backup/state
ONLINE
# cat /proc/spl/kstat/zfs/backup/txgs
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
9763857 2125045610158 C 0 1594417152 9112899584 387304 2210920 302290160 3568 233859 1757877021831
9763858 2125347900318 S 0 0 0 0 0 1757877274953 5306 271048 0
9763859 3883225175271 O 0 0 0 0 0 0 0 0 0
# tail /proc/spl/kstat/zfs/dbgmsg
1749475864 metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 6983, smp_length 66648, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 180224, unloaded time 3883217 ms, loading_time 31 ms, ms_max_size 49152, max size error 49152, old_weight 3c000000000000f, new_weight 3c000000000000f
1749475864 metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 8529, smp_length 61808, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 3883230 ms, loading_time 26 ms, ms_max_size 49152, max size error 49152, old_weight 3c000000000000f, new_weight 3c0000000000012
1749475864 metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 8639, smp_length 18296, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 98304, unloaded time 3883236 ms, loading_time 31 ms, ms_max_size 49152, max size error 49152, old_weight 3c000000000000f, new_weight 3c000000000000f
1749475865 metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 38323, smp_length 4832, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 3883224 ms, loading_time 1076 ms, ms_max_size 17179820032, max size error 17179820032, old_weight 840000000000001, new_weight 840000000000001
# rg -v -N metaslab.c dbgmsg | tail
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beaf4ef0-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beae362c-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beaf56c4-part1 has 2 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2bebbd8d8-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be5c5f8c-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be717e50-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be517a50-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be4c0538-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be2c43ec-part1 has 1 active IOs
1749475845 vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beafc79c-part1 has 1 active IOs
# rg -v -N 'metaslab.c|vdev_dead' dbgmsg | tail
1749473353 spa_misc.c:418:spa_load_note(): spa_load(backup, config trusted): read 0 log space maps (0 total blocks - blksz = 131072 bytes) in 0 ms
1749474106 spa_misc.c:418:spa_load_note(): spa_load(backup, config trusted): spa_load_verify found 0 metadata errors and 1 data errors
1749474106 mmp.c:239:mmp_thread_start(): MMP thread started pool 'backup' gethrtime 2125045596856
1749474106 spa.c:8829:spa_async_request(): spa=backup async request task=1
1749474106 spa.c:8829:spa_async_request(): spa=backup async request task=2048
1749474106 spa_misc.c:418:spa_load_note(): spa_load(backup, config trusted): LOADED
1749474107 spa_history.c:306:spa_history_log_sync(): txg 9763857 open pool version 5000; software version zfs-2.2.2-0-g494aaaed8-dist; uts <hostname> 6.8.0-58-generic #60-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 14 18:29:48 UTC 2025 x86_64
1749474739 spa_misc.c:640:spa_deadman(): slow spa_sync: started 632 seconds ago, calls 1
1749474800 spa_misc.c:640:spa_deadman(): slow spa_sync: started 693 seconds ago, calls 2
1749474862 spa_misc.c:640:spa_deadman(): slow spa_sync: started 755 seconds ago, calls 3
1749474923 spa_misc.c:640:spa_deadman(): slow spa_sync: started 816 seconds ago, calls 4
1749474984 spa_misc.c:640:spa_deadman(): slow spa_sync: started 878 seconds ago, calls 5
1749475046 spa_misc.c:640:spa_deadman(): slow spa_sync: started 939 seconds ago, calls 6
1749475107 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1000 seconds ago, calls 7
1749475169 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1062 seconds ago, calls 8
1749475230 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1123 seconds ago, calls 9
1749475292 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1185 seconds ago, calls 10
1749475353 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1246 seconds ago, calls 11
1749475415 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1308 seconds ago, calls 12
1749475476 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1369 seconds ago, calls 13
1749475537 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1431 seconds ago, calls 14
1749475599 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1492 seconds ago, calls 15
1749475660 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1553 seconds ago, calls 16
1749475722 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1615 seconds ago, calls 17
1749475783 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1676 seconds ago, calls 18
1749475845 spa_misc.c:640:spa_deadman(): slow spa_sync: started 1738 seconds ago, calls 19
# ps ax | grep txg
28508 ? S 0:00 [txg_quiesce]
28509 ? D 0:05 [txg_sync]
# cat /proc/28509/stack
[<0>] __cv_timedwait_common+0x139/0x180 [spl]
[<0>] __cv_timedwait_io+0x19/0x30 [spl]
[<0>] zio_wait+0x119/0x230 [zfs]
[<0>] dsl_pool_sync_mos+0x3c/0xc0 [zfs]
[<0>] dsl_pool_sync+0x3b3/0x410 [zfs]
[<0>] spa_sync_iterate_to_convergence+0xde/0x220 [zfs]
[<0>] spa_sync+0x321/0x620 [zfs]
[<0>] txg_sync_thread+0x1e7/0x250 [zfs]
[<0>] thread_generic_wrapper+0x5c/0x70 [spl]
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x44/0x70
[<0>] ret_from_fork_asm+0x1b/0x30
# cat /proc/28509/stack
[<0>] __cv_timedwait_common+0x139/0x180 [spl]
[<0>] __cv_timedwait_io+0x19/0x30 [spl]
[<0>] zio_wait+0x119/0x230 [zfs]
[<0>] dsl_pool_sync_mos+0x3c/0xc0 [zfs]
[<0>] dsl_pool_sync+0x3b3/0x410 [zfs]
[<0>] spa_sync_iterate_to_convergence+0xde/0x220 [zfs]
[<0>] spa_sync+0x321/0x620 [zfs]
[<0>] txg_sync_thread+0x1e7/0x250 [zfs]
[<0>] thread_generic_wrapper+0x5c/0x70 [spl]
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x44/0x70
[<0>] ret_from_fork_asm+0x1b/0x30
I don't see any indications of slow I/O though. For example, right now:
# grep -Ev '^[ 0]+$' /sys/block/sd*/inflight
/sys/block/sdai/inflight
1 0
e.g., there's only one disk (sda) that currently has any I/O pending, and that disk isn't part of the pool.
Is anyone interested in investigating this, or are there any debugging information I should save for later?
Otherwise I'll wipe and recreate the pool.
I've just encountered the same issue on a different zpool (similar size, ~100 disks, ~2 PB total capacity). Attempting to import the zpool regularly just hangs. A read-only import succeeds after ~5 minutes.
Really no one any suggestions, or interested in looking into this?
@Nikratio It should not "just hang", if hardware is behaving (disks are not hanging indefinitely on I/Os). But depending on versions, workload and other factors sometimes it may take time. It might be replaying spacemap log (was optimized some time ago), it may be relaying ZIL, it might be something else, that is not done during read-only import.
As far as I can ascertain, there are frequent periods with no outstanding I/O to the disk (so it's not waiting for that), overall disk bandwidth is close to zero, and CPU utilisation is negligible.
So it might be doing something, but it is doing so at a rate that is much slower that's theoretically possible, and it is slow that for all practical purposes it is hanging (the first zpool was in this state for more than 10 days).
Is there a way to use the kernel stack traces to ascertain what it is doing, and what it is waiting for?
@Nikratio Import process logs what it is doing into ZFS debug messages (dbgmsg), which you might get via procfs on Linux or sysctl on FreeBSD, as you may see @quadniko quoted above form his system. I don't think it covers filesystem mount and so ZIL replay, but that would be a good start. You can also try to get kernel stacks for the relevant processes via procfs.
Ah, apologies, different account but same person. Quadniko is me. So I think you have all of the requested? Or do would it help if I provided the same for the second pool?
It seems doing something with meta objset in a sync context, but I don't see what exactly. Considering your use of HDDs and RAIDZ3, your vdevs may be pretty wide and so not the fastest in terms of IOPS. It may cause slower operation when some data are not in ARC after reboot. Considering you having a lot of datasets and probably much more if multiplied by number of snapshots, I wonder if there might be some massive asynchronous delete operation, or something like that. But I am just speculating.
Anything I can do to help narrow it down? Or perhaps even to speed it up or abort whatever is happening?
I'm 95% confident that no data has ever been deleted from this zpool, it's a backup system that only ever accumulates more snapshots.
Thinking about this, are you saying that it if someone had deleted a dataset and all it's snapshots with "zfs delete", would this command return quickly but the deferred deletion then block importing the zpool on the next boot?
There is such thing as deferred delete, but I don't remember much details out of my head.