zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Can't import NVME root pool after trim and scrub

Open Freewalkr opened this issue 2 years ago • 4 comments

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version LTS
Kernel Version 6.1.19-1
Architecture x86_64
OpenZFS Version zfs-2.1.99-1, ahrens-raidz-expand branch

Describe the problem you're observing

I'm using ZFS version from zfs-dkms-raidz-expansion-git AUR package and have root on ZFS on NVME drive. NVME is not in any RAID-Z, it's the only drive in the pool. I haven't run scrub on this pool for like half a year, while I upgraded to raidz-expand branch about 3 months ago.

Yesterday I deleted some big files in the root pool, created and deleted a couple of snapshots, ran a trim manually (it went without errors) and ran scrub on the pool. In progress zpool status responded well, but after the end of scrubbing zpool status hung in "uninterruptible sleep" state. I didn't look into logs, waited for 12 hours, rebooted - and system didn't come back.

I booted from Manjaro 6.1.19-1 installed on the other drive with the same ZFS version and tried to import the root pool from NVME. It went into uninterruptible sleep too, there's verification panic. dmesg output is attached below.

Describe how to reproduce the problem

  1. Use ahrens-raidz-expand branch (maybe it's zfs-dkms-raidz-expansion-git package specific?)
  2. Upgrade the pool on NVME with new features.
  3. Wait for about 3 months.
  4. Delete some files, run a trim manually.
  5. Run a scrub.

Include any warning/errors/backtraces from the system logs

dmesg output when trying to import the pool:

[   82.223202] VERIFY3(0 == dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)) failed (0 == 2)
[   82.223205] PANIC at spa_errlog.c:1068:delete_errlog()
[   82.223207] Showing stack for process 8492
[   82.223208] CPU: 5 PID: 8492 Comm: txg_sync Tainted: P           OE      6.1.19-1-MANJARO #1 389fdd3d7a99644f7437b855eb50ad5703eff2d2
[   82.223210] Hardware name: Gigabyte Technology Co., Ltd. X570S AERO G/X570S AERO G, BIOS F4c 05/12/2022
[   82.223211] Call Trace:
[   82.223212]  <TASK>
[   82.223214]  dump_stack_lvl+0x48/0x60
[   82.223218]  spl_panic+0xf4/0x10c [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[   82.223227]  spa_errlog_sync+0x2b7/0x2d0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223299]  ? spa_change_guid_check+0xe0/0xe0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223355]  ? spa_change_guid_check+0xe0/0xe0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223405]  ? spa_sync+0x554/0xf70 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223455]  ? spa_txg_history_init_io+0x117/0x120 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223513]  ? txg_sync_thread+0x201/0x3a0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223567]  ? txg_fini+0x260/0x260 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223618]  ? spl_taskq_fini+0x80/0x80 [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[   82.223624]  ? thread_generic_wrapper+0x5e/0x70 [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[   82.223629]  ? kthread+0xde/0x110
[   82.223631]  ? kthread_complete_and_exit+0x20/0x20
[   82.223633]  ? ret_from_fork+0x22/0x30
[   82.223636]  </TASK>

Freewalkr avatar Mar 17 '23 15:03 Freewalkr

I made the mistake of running ZFS trim.. The trim never finished, and the system hung.

I can now only import the pool in read-only mode. Anything else hangs and never completes.

I can't cancel the trim, and I can't mount the pool read/write anymore. What are my options?

systemmonkey42 avatar Apr 01 '23 05:04 systemmonkey42

I have a very similar situation. Arm64 6.6.44 kernel with zfs 2.2.4, been running for several years, through thick and thin. I activated a monthy trim job, and once it launched, it hung, made no progress and now my import hangs. readonly=on import works fine Is there no solution for this?
It seems that at the very least some serious warnings about using trim should be made!
Is there no way to cancel the trim operations when doing an import?

[ 3021.820408] INFO: task zpool:4297 blocked for more than 845 seconds.
[ 3021.826930]       Tainted: P         C O       6.6.44-1-rpi #1
[ 3021.832838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.840713] task:zpool           state:D stack:0     pid:4297  ppid:1259   flags:0x0000000c
[ 3021.849077] Call trace:
[ 3021.851517]  __switch_to+0xe0/0x178
[ 3021.855014]  __schedule+0x37c/0xaf0
[ 3021.858504]  schedule+0x64/0x108
[ 3021.861733]  spl_panic+0x110/0x120 [spl]
[ 3021.865696]  vdev_trim_calculate_progress+0x34c/0x370 [zfs]
[ 3021.871772]  vdev_trim_load+0x38/0x150 [zfs]
[ 3021.876482]  vdev_trim_restart+0x128/0x250 [zfs]
[ 3021.881498]  vdev_trim_restart+0x54/0x250 [zfs]
[ 3021.886437]  vdev_trim_restart+0x54/0x250 [zfs]
[ 3021.891369]  spa_load+0x1648/0x1770 [zfs]
[ 3021.895762]  spa_load_best+0x5c/0x2b0 [zfs]
[ 3021.900332]  spa_import+0x1ec/0x608 [zfs]
[ 3021.904724]  zfs_ioc_pool_import+0x14c/0x178 [zfs]
[ 3021.909907]  zfsdev_ioctl_common+0x808/0x890 [zfs]
[ 3021.915094]  zfsdev_ioctl+0x70/0x108 [zfs]
[ 3021.919599]  __arm64_sys_ioctl+0xb4/0x100
[ 3021.923632]  invoke_syscall+0x50/0x120
[ 3021.927390]  el0_svc_common.constprop.0+0x48/0xf0
[ 3021.932098]  do_el0_svc+0x24/0x38
[ 3021.935416]  el0_svc+0x40/0xe8
[ 3021.938475]  el0t_64_sync_handler+0x120/0x130
[ 3021.942836]  el0t_64_sync+0x190/0x198

baslking avatar Aug 09 '24 14:08 baslking