zfs icon indicating copy to clipboard operation
zfs copied to clipboard

'zpool import' hangs, `zpool import -o readonly-true` succeeds

Open quadniko opened this issue 7 months ago • 4 comments

System information

Distribution Name | Ubuntu Distribution Version | Noble Kernel Version |6.8.0-58-generic Architecture | x86_64 OpenZFS Version |zfs-2.2.2-0ubuntu9.2

Describe the problem you're observing

A few weeks ago, after a routine reboot, one of our zpools would hang when imported from the cache file. The zpool is about 1.5 PB in size and we estimate it had ~50k datasets and snapshots on it.

After hanging for ~10 days, we rebooted in emergency mode and looked for disk failures. When we saw no failed disk, we continued to boot into multi-user mode and the zpool came up within ~5 minutes. zpool status indicated reported all disks as healthy, and that a scrub was in progress but not making progress. When re-issuing the zpool scrub command, we discovered that the zpool was unexpectedly mounted read-only.

At this point, we rebooted again to confirm that the zpool would come up reliably (and hopefully in rw mode). However, this time the import hang again. After ~24h we upgraded kernel and operating system to the versions shown above (removing the cachefile in the process). At this point we got the following output:

# zpool import -o readonly --rewind-to-checkpoint 
   pool: backup
     id: [...]
  state: ONLINE
status: Some supported features are not enabled on the pool.
	(Note that they may be intentionally disabled if the
	'compatibility' property is set.)
 action: The pool can be imported using its name or numeric identifier, though
	some features will not be available without an explicit 'zpool upgrade'.
 config:

	backup                                                      ONLINE
	  raidz3-0                                                  ONLINE
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy0-lun-0   ONLINE
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy10-lun-0  ONLINE
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy11-lun-0  ONLINE
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy12-lun-0  ONLINE
[...]
	  raidz3-1                                                  ONLINE
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy40-lun-0  ONLINE
	[...]
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy8-lun-0   ONLINE
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8f9-phy9-lun-0   ONLINE
	    pci-0000:b1:00.0-sas-exp0x5000ccab0549b8fb-phy0-lun-0   ONLINE
	  raidz3-2                                                  ONLINE
[...]

smartctl reported a health status for every disk as well.

However, when attempting to import the pool, we got either errors:

# zpool import -o readonly --rewind-to-checkpoint -a -f
cannot import 'backup': checkpoint does not exist
	Destroy and re-create the pool from
	a backup source.

# zpool import -o readonly --rewind-to-checkpoint -f backup
cannot import 'backup': checkpoint does not exist
	Destroy and re-create the pool from
	a backup source.

..or the import command hung again:

# zpool import -o readonly -F -f backup

This time, after about 6 hours, we got the following message in the kernel logs:

kernel: WARNING: Pool 'backup' has encountered an uncorrectable I/O failure and has been suspended.

However, smartctl still listed all disks as healthy, and there were no other messages indicating which disk (if any) has been affected.

At this point we are not sure if we have found a bug in ZFS, or if we are dealing with a faulty disk. Any recommendations how to distinguish between the two (and potentially identify which disk is faulty) would be appreciated.

quadniko avatar Jun 05 '25 11:06 quadniko

Some more information:

Importing with -o readonly=yes works (taking about 5 minutes). Only importing read-write hangs.

blktrace shows a steady stream of I/O requests to the disk, but none of them seem to be originating from the zpool command, they're all from z_rd_iss and txg_sync, e.g.

  8,192 15       25     2.601617543 22136  A   R 33974611072 + 8 <- (8,193) 33974609024
  8,192 15       26     2.601617651 22136  Q   R 33974611072 + 8 [z_rd_iss]
  8,192 15       27     2.601618089 22136  G   R 33974611072 + 8 [z_rd_iss]
  8,192 15       28     2.601619294 22136  P   N [z_rd_iss]
  8,192 15       29     2.601619495 22136  U   N [z_rd_iss] 1
  8,192 15       30     2.601619714 22136  I   R 33974611072 + 8 [z_rd_iss]
  8,192 15       31     2.601620779 22136  D   R 33974611072 + 8 [z_rd_iss]
  8,192 15       32     2.604587471     0  C   R 33974611072 + 8 [0]
 67,224 13       32     2.565621543     0  C   R 27258321760 + 8 [0]
 66,208 15       18     2.565644350 22136  A   R 34960332984 + 8 <- (66,209) 34960330936
 66,208 15       19     2.565644473 22136  Q   R 34960332984 + 8 [z_rd_iss]
 66,208 15       20     2.565645002 22136  G   R 34960332984 + 8 [z_rd_iss]
 66,208 15       21     2.565646399 22136  P   N [z_rd_iss]
 66,208 15       22     2.565646593 22136  U   N [z_rd_iss] 1
 66,208 15       23     2.565646838 22136  I   R 34960332984 + 8 [z_rd_iss]
 66,208 15       24     2.565648775 22136  D   R 34960332984 + 8 [z_rd_iss]
 66,208 15       25     2.570782977     0  C   R 34960332984 + 8 [0]
 68,64  13       17     2.581605664 25065  A   R 4951839112 + 8 <- (68,65) 4951837064
 68,64  13       18     2.581605813 25065  Q   R 4951839112 + 8 [txg_sync]
 68,64  13       19     2.581606674 25065  G   R 4951839112 + 8 [txg_sync]
 68,64  13       20     2.581608016 25065  P   N [txg_sync]
 68,64  13       21     2.581608261 25065  U   N [txg_sync] 1
 68,64  13       22     2.581608494 25065  I   R 4951839112 + 8 [txg_sync]
 68,64  13       23     2.581609890 25065  D   R 4951839112 + 8 [txg_sync]
 68,64  13       24     2.586092225     0  C   R 4951839112 + 8 [0]
 66,240 13       25     2.586114792 22139  A   R 7812989376 + 8 <- (66,241) 7812987328
 66,240 13       26     2.586114913 22139  Q   R 7812989376 + 8 [z_rd_iss]
 66,240 13       27     2.586115617 22139  G   R 7812989376 + 8 [z_rd_iss]
 66,240 13       28     2.586116749 22139  P   N [z_rd_iss]
 66,240 13       29     2.586116949 22139  U   N [z_rd_iss] 1

while

# blktrace -I ~/harddisks.txt -a complete -o - | blkparse -i - | grep zpool
[nothing for 900+ seconds]

The stack trace of the stuck zpool command does not seem to change:

Jun 09 12:17:54 kernel: task:zpool           state:D stack:0     pid:15721 tgid:15721 ppid:15469  flags:0x00004002
Jun 09 12:17:54 kernel: Call Trace:
Jun 09 12:17:54 kernel:  <TASK>
Jun 09 12:17:54 kernel:  __schedule+0x27c/0x6b0
Jun 09 12:17:54 kernel:  ? try_to_wake_up+0x11b/0x640
Jun 09 12:17:54 kernel:  schedule+0x33/0x110
Jun 09 12:17:54 kernel:  io_schedule+0x46/0x80
Jun 09 12:17:54 kernel:  cv_wait_common+0xab/0x140 [spl]
Jun 09 12:17:54 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Jun 09 12:17:54 kernel:  __cv_wait_io+0x18/0x30 [spl]
Jun 09 12:17:54 kernel:  txg_wait_synced_impl+0x105/0x150 [zfs]
Jun 09 12:17:54 kernel:  txg_wait_synced+0x10/0x60 [zfs]
Jun 09 12:17:54 kernel:  spa_config_update+0x152/0x1a0 [zfs]
Jun 09 12:17:54 kernel:  spa_import+0x5c3/0x670 [zfs]
Jun 09 12:17:54 kernel:  zfs_ioc_pool_import+0x163/0x180 [zfs]
Jun 09 12:17:54 kernel:  zfsdev_ioctl_common+0x596/0x6a0 [zfs]
Jun 09 12:17:54 kernel:  ? __check_object_size.part.0+0x72/0x150
Jun 09 12:17:54 kernel:  zfsdev_ioctl+0x57/0xf0 [zfs]
Jun 09 12:17:54 kernel:  __x64_sys_ioctl+0xa0/0xf0
Jun 09 12:17:54 kernel:  x64_sys_call+0x12a3/0x25a0
Jun 09 12:17:54 kernel:  do_syscall_64+0x7f/0x180
Jun 09 12:17:54 kernel:  ? handle_pte_fault+0x1cb/0x1d0
Jun 09 12:17:54 kernel:  ? __handle_mm_fault+0x654/0x800
Jun 09 12:17:54 kernel:  ? __count_memcg_events+0x6b/0x120
Jun 09 12:17:54 kernel:  ? count_memcg_events.constprop.0+0x2a/0x50
Jun 09 12:17:54 kernel:  ? handle_mm_fault+0xad/0x380
Jun 09 12:17:54 kernel:  ? do_user_addr_fault+0x333/0x670
Jun 09 12:17:54 kernel:  ? irqentry_exit_to_user_mode+0x7b/0x260
Jun 09 12:17:54 kernel:  ? irqentry_exit+0x43/0x50
Jun 09 12:17:54 kernel:  ? clear_bhb_loop+0x15/0x70
Jun 09 12:17:54 kernel:  ? clear_bhb_loop+0x15/0x70
Jun 09 12:17:54 kernel:  ? clear_bhb_loop+0x15/0x70
Jun 09 12:17:54 kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Jun 09 12:17:54 kernel: RIP: 0033:0x7078c0f24ded
Jun 09 12:17:54 kernel: RSP: 002b:00007ffd074bd670 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 09 12:17:54 kernel: RAX: ffffffffffffffda RBX: 00005e250de84c30 RCX: 00007078c0f24ded
Jun 09 12:17:54 kernel: RDX: 00007ffd074be030 RSI: 0000000000005a02 RDI: 0000000000000003
Jun 09 12:17:54 kernel: RBP: 00007ffd074bd6c0 R08: 0000000000000000 R09: 0000000000000000
Jun 09 12:17:54 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00005e250dbef4e0
Jun 09 12:17:54 kernel: R13: 00007ffd074be030 R14: 00005e250dc262d0 R15: 00005e250dbfa150
Jun 09 12:17:54 kernel:  </TASK>

ghost avatar Jun 09 '25 12:06 ghost

Found a few more things that might be helpful:

# cat /proc/spl/kstat/zfs/import_progress
pool_guid            load_state     multihost_secs max_txg      pool_name

# cat /proc/spl/kstat/zfs/backup/state
ONLINE

# cat /proc/spl/kstat/zfs/backup/txgs
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
9763857  2125045610158    C     0            1594417152   9112899584   387304   2210920  302290160    3568         233859       1757877021831
9763858  2125347900318    S     0            0            0            0        0        1757877274953 5306         271048       0           
9763859  3883225175271    O     0            0            0            0        0        0            0            0            0  

# tail  /proc/spl/kstat/zfs/dbgmsg 
1749475864   metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 6983, smp_length 66648, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 180224, unloaded time 3883217 ms, loading_time 31 ms, ms_max_size 49152, max size error 49152, old_weight 3c000000000000f, new_weight 3c000000000000f
1749475864   metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 8529, smp_length 61808, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 3883230 ms, loading_time 26 ms, ms_max_size 49152, max size error 49152, old_weight 3c000000000000f, new_weight 3c0000000000012
1749475864   metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 8639, smp_length 18296, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 98304, unloaded time 3883236 ms, loading_time 31 ms, ms_max_size 49152, max size error 49152, old_weight 3c000000000000f, new_weight 3c000000000000f
1749475865   metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 9763858, spa backup, vdev_id 2, ms_id 38323, smp_length 4832, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 3883224 ms, loading_time 1076 ms, ms_max_size 17179820032, max size error 17179820032, old_weight 840000000000001, new_weight 840000000000001

# rg -v -N metaslab.c dbgmsg | tail
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beaf4ef0-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beae362c-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beaf56c4-part1 has 2 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2bebbd8d8-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be5c5f8c-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be717e50-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be517a50-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be4c0538-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2be2c43ec-part1 has 1 active IOs
1749475845   vdev.c:5509:vdev_deadman(): slow vdev: /dev/disk/by-id/wwn-0x5000cca2beafc79c-part1 has 1 active IOs

# rg -v -N 'metaslab.c|vdev_dead' dbgmsg | tail 
1749473353   spa_misc.c:418:spa_load_note(): spa_load(backup, config trusted): read 0 log space maps (0 total blocks - blksz = 131072 bytes) in 0 ms
1749474106   spa_misc.c:418:spa_load_note(): spa_load(backup, config trusted): spa_load_verify found 0 metadata errors and 1 data errors
1749474106   mmp.c:239:mmp_thread_start(): MMP thread started pool 'backup' gethrtime 2125045596856
1749474106   spa.c:8829:spa_async_request(): spa=backup async request task=1
1749474106   spa.c:8829:spa_async_request(): spa=backup async request task=2048
1749474106   spa_misc.c:418:spa_load_note(): spa_load(backup, config trusted): LOADED
1749474107   spa_history.c:306:spa_history_log_sync(): txg 9763857 open pool version 5000; software version zfs-2.2.2-0-g494aaaed8-dist; uts <hostname> 6.8.0-58-generic #60-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 14 18:29:48 UTC 2025 x86_64
1749474739   spa_misc.c:640:spa_deadman(): slow spa_sync: started 632 seconds ago, calls 1
1749474800   spa_misc.c:640:spa_deadman(): slow spa_sync: started 693 seconds ago, calls 2
1749474862   spa_misc.c:640:spa_deadman(): slow spa_sync: started 755 seconds ago, calls 3
1749474923   spa_misc.c:640:spa_deadman(): slow spa_sync: started 816 seconds ago, calls 4
1749474984   spa_misc.c:640:spa_deadman(): slow spa_sync: started 878 seconds ago, calls 5
1749475046   spa_misc.c:640:spa_deadman(): slow spa_sync: started 939 seconds ago, calls 6
1749475107   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1000 seconds ago, calls 7
1749475169   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1062 seconds ago, calls 8
1749475230   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1123 seconds ago, calls 9
1749475292   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1185 seconds ago, calls 10
1749475353   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1246 seconds ago, calls 11
1749475415   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1308 seconds ago, calls 12
1749475476   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1369 seconds ago, calls 13
1749475537   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1431 seconds ago, calls 14
1749475599   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1492 seconds ago, calls 15
1749475660   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1553 seconds ago, calls 16
1749475722   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1615 seconds ago, calls 17
1749475783   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1676 seconds ago, calls 18
1749475845   spa_misc.c:640:spa_deadman(): slow spa_sync: started 1738 seconds ago, calls 19
# ps ax | grep txg
  28508 ?        S      0:00 [txg_quiesce]
  28509 ?        D      0:05 [txg_sync]

# cat /proc/28509/stack
[<0>] __cv_timedwait_common+0x139/0x180 [spl]
[<0>] __cv_timedwait_io+0x19/0x30 [spl]
[<0>] zio_wait+0x119/0x230 [zfs]
[<0>] dsl_pool_sync_mos+0x3c/0xc0 [zfs]
[<0>] dsl_pool_sync+0x3b3/0x410 [zfs]
[<0>] spa_sync_iterate_to_convergence+0xde/0x220 [zfs]
[<0>] spa_sync+0x321/0x620 [zfs]
[<0>] txg_sync_thread+0x1e7/0x250 [zfs]
[<0>] thread_generic_wrapper+0x5c/0x70 [spl]
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x44/0x70
[<0>] ret_from_fork_asm+0x1b/0x30

# cat /proc/28509/stack
[<0>] __cv_timedwait_common+0x139/0x180 [spl]
[<0>] __cv_timedwait_io+0x19/0x30 [spl]
[<0>] zio_wait+0x119/0x230 [zfs]
[<0>] dsl_pool_sync_mos+0x3c/0xc0 [zfs]
[<0>] dsl_pool_sync+0x3b3/0x410 [zfs]
[<0>] spa_sync_iterate_to_convergence+0xde/0x220 [zfs]
[<0>] spa_sync+0x321/0x620 [zfs]
[<0>] txg_sync_thread+0x1e7/0x250 [zfs]
[<0>] thread_generic_wrapper+0x5c/0x70 [spl]
[<0>] kthread+0xef/0x120
[<0>] ret_from_fork+0x44/0x70
[<0>] ret_from_fork_asm+0x1b/0x30

ghost avatar Jun 09 '25 13:06 ghost

I don't see any indications of slow I/O though. For example, right now:

# grep -Ev '^[ 0]+$'  /sys/block/sd*/inflight
/sys/block/sdai/inflight
       1        0

e.g., there's only one disk (sda) that currently has any I/O pending, and that disk isn't part of the pool.

ghost avatar Jun 09 '25 13:06 ghost

Is anyone interested in investigating this, or are there any debugging information I should save for later?

Otherwise I'll wipe and recreate the pool.

ghost avatar Jun 16 '25 08:06 ghost

I've just encountered the same issue on a different zpool (similar size, ~100 disks, ~2 PB total capacity). Attempting to import the zpool regularly just hangs. A read-only import succeeds after ~5 minutes.

Really no one any suggestions, or interested in looking into this?

Nikratio avatar Sep 06 '25 16:09 Nikratio

@Nikratio It should not "just hang", if hardware is behaving (disks are not hanging indefinitely on I/Os). But depending on versions, workload and other factors sometimes it may take time. It might be replaying spacemap log (was optimized some time ago), it may be relaying ZIL, it might be something else, that is not done during read-only import.

amotin avatar Sep 09 '25 19:09 amotin

As far as I can ascertain, there are frequent periods with no outstanding I/O to the disk (so it's not waiting for that), overall disk bandwidth is close to zero, and CPU utilisation is negligible.

So it might be doing something, but it is doing so at a rate that is much slower that's theoretically possible, and it is slow that for all practical purposes it is hanging (the first zpool was in this state for more than 10 days).

Is there a way to use the kernel stack traces to ascertain what it is doing, and what it is waiting for?

Nikratio avatar Sep 09 '25 20:09 Nikratio

@Nikratio Import process logs what it is doing into ZFS debug messages (dbgmsg), which you might get via procfs on Linux or sysctl on FreeBSD, as you may see @quadniko quoted above form his system. I don't think it covers filesystem mount and so ZIL replay, but that would be a good start. You can also try to get kernel stacks for the relevant processes via procfs.

amotin avatar Sep 09 '25 20:09 amotin

Ah, apologies, different account but same person. Quadniko is me. So I think you have all of the requested? Or do would it help if I provided the same for the second pool?

Nikratio avatar Sep 10 '25 05:09 Nikratio

It seems doing something with meta objset in a sync context, but I don't see what exactly. Considering your use of HDDs and RAIDZ3, your vdevs may be pretty wide and so not the fastest in terms of IOPS. It may cause slower operation when some data are not in ARC after reboot. Considering you having a lot of datasets and probably much more if multiplied by number of snapshots, I wonder if there might be some massive asynchronous delete operation, or something like that. But I am just speculating.

amotin avatar Sep 11 '25 18:09 amotin

Anything I can do to help narrow it down? Or perhaps even to speed it up or abort whatever is happening?

I'm 95% confident that no data has ever been deleted from this zpool, it's a backup system that only ever accumulates more snapshots.

Nikratio avatar Sep 11 '25 19:09 Nikratio

Thinking about this, are you saying that it if someone had deleted a dataset and all it's snapshots with "zfs delete", would this command return quickly but the deferred deletion then block importing the zpool on the next boot?

Nikratio avatar Sep 11 '25 19:09 Nikratio

There is such thing as deferred delete, but I don't remember much details out of my head.

amotin avatar Sep 11 '25 19:09 amotin