zfs
zfs copied to clipboard
Can't import NVME root pool after trim and scrub
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Arch Linux |
| Distribution Version | LTS |
| Kernel Version | 6.1.19-1 |
| Architecture | x86_64 |
| OpenZFS Version | zfs-2.1.99-1, ahrens-raidz-expand branch |
Describe the problem you're observing
I'm using ZFS version from zfs-dkms-raidz-expansion-git AUR package and have root on ZFS on NVME drive. NVME is not in any RAID-Z, it's the only drive in the pool. I haven't run scrub on this pool for like half a year, while I upgraded to raidz-expand branch about 3 months ago.
Yesterday I deleted some big files in the root pool, created and deleted a couple of snapshots, ran a trim manually (it went without errors) and ran scrub on the pool. In progress zpool status responded well, but after the end of scrubbing zpool status hung in "uninterruptible sleep" state. I didn't look into logs, waited for 12 hours, rebooted - and system didn't come back.
I booted from Manjaro 6.1.19-1 installed on the other drive with the same ZFS version and tried to import the root pool from NVME. It went into uninterruptible sleep too, there's verification panic. dmesg output is attached below.
Describe how to reproduce the problem
- Use
ahrens-raidz-expandbranch (maybe it'szfs-dkms-raidz-expansion-gitpackage specific?) - Upgrade the pool on NVME with new features.
- Wait for about 3 months.
- Delete some files, run a trim manually.
- Run a scrub.
Include any warning/errors/backtraces from the system logs
dmesg output when trying to import the pool:
[ 82.223202] VERIFY3(0 == dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)) failed (0 == 2)
[ 82.223205] PANIC at spa_errlog.c:1068:delete_errlog()
[ 82.223207] Showing stack for process 8492
[ 82.223208] CPU: 5 PID: 8492 Comm: txg_sync Tainted: P OE 6.1.19-1-MANJARO #1 389fdd3d7a99644f7437b855eb50ad5703eff2d2
[ 82.223210] Hardware name: Gigabyte Technology Co., Ltd. X570S AERO G/X570S AERO G, BIOS F4c 05/12/2022
[ 82.223211] Call Trace:
[ 82.223212] <TASK>
[ 82.223214] dump_stack_lvl+0x48/0x60
[ 82.223218] spl_panic+0xf4/0x10c [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[ 82.223227] spa_errlog_sync+0x2b7/0x2d0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[ 82.223299] ? spa_change_guid_check+0xe0/0xe0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[ 82.223355] ? spa_change_guid_check+0xe0/0xe0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[ 82.223405] ? spa_sync+0x554/0xf70 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[ 82.223455] ? spa_txg_history_init_io+0x117/0x120 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[ 82.223513] ? txg_sync_thread+0x201/0x3a0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[ 82.223567] ? txg_fini+0x260/0x260 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[ 82.223618] ? spl_taskq_fini+0x80/0x80 [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[ 82.223624] ? thread_generic_wrapper+0x5e/0x70 [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[ 82.223629] ? kthread+0xde/0x110
[ 82.223631] ? kthread_complete_and_exit+0x20/0x20
[ 82.223633] ? ret_from_fork+0x22/0x30
[ 82.223636] </TASK>
I made the mistake of running ZFS trim.. The trim never finished, and the system hung.
I can now only import the pool in read-only mode. Anything else hangs and never completes.
I can't cancel the trim, and I can't mount the pool read/write anymore. What are my options?
I have a very similar situation.
Arm64 6.6.44 kernel with zfs 2.2.4, been running for several years, through thick and thin. I activated a monthy trim job, and once it launched, it hung, made no progress and now my import hangs. readonly=on import works fine
Is there no solution for this?
It seems that at the very least some serious warnings about using trim should be made!
Is there no way to cancel the trim operations when doing an import?
[ 3021.820408] INFO: task zpool:4297 blocked for more than 845 seconds.
[ 3021.826930] Tainted: P C O 6.6.44-1-rpi #1
[ 3021.832838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.840713] task:zpool state:D stack:0 pid:4297 ppid:1259 flags:0x0000000c
[ 3021.849077] Call trace:
[ 3021.851517] __switch_to+0xe0/0x178
[ 3021.855014] __schedule+0x37c/0xaf0
[ 3021.858504] schedule+0x64/0x108
[ 3021.861733] spl_panic+0x110/0x120 [spl]
[ 3021.865696] vdev_trim_calculate_progress+0x34c/0x370 [zfs]
[ 3021.871772] vdev_trim_load+0x38/0x150 [zfs]
[ 3021.876482] vdev_trim_restart+0x128/0x250 [zfs]
[ 3021.881498] vdev_trim_restart+0x54/0x250 [zfs]
[ 3021.886437] vdev_trim_restart+0x54/0x250 [zfs]
[ 3021.891369] spa_load+0x1648/0x1770 [zfs]
[ 3021.895762] spa_load_best+0x5c/0x2b0 [zfs]
[ 3021.900332] spa_import+0x1ec/0x608 [zfs]
[ 3021.904724] zfs_ioc_pool_import+0x14c/0x178 [zfs]
[ 3021.909907] zfsdev_ioctl_common+0x808/0x890 [zfs]
[ 3021.915094] zfsdev_ioctl+0x70/0x108 [zfs]
[ 3021.919599] __arm64_sys_ioctl+0xb4/0x100
[ 3021.923632] invoke_syscall+0x50/0x120
[ 3021.927390] el0_svc_common.constprop.0+0x48/0xf0
[ 3021.932098] do_el0_svc+0x24/0x38
[ 3021.935416] el0_svc+0x40/0xe8
[ 3021.938475] el0t_64_sync_handler+0x120/0x130
[ 3021.942836] el0t_64_sync+0x190/0x198