zfs
zfs copied to clipboard
ZTS reproducibly panics on `zfs_rename_014_neg` on ppc64el
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Debian |
| Distribution Version | 11 |
| Kernel Version | 5.15.41 |
| Architecture | ppc64el |
| OpenZFS Version | 2cd0f98f4 |
Describe the problem you're observing
On trying to put the BLAKE3 PR through its paces, I hit a kernel panic on zfs_rename_014_neg. Confused, I tried running it against vanilla git, and lo, I had two panics.
Describe how to reproduce the problem
scripts/zfs-tests.sh -T zfs_rename on a ppc64el system, AFAICT.
Include any warning/errors/backtraces from the system logs
[ 913.019952] synth uevent: /devices/vio: failed to send uevent
[ 913.020021] vio vio: uevent: failed to send synthetic uevent
[ 913.341239] synth uevent: /devices/vio: failed to send uevent
[ 913.341316] vio vio: uevent: failed to send synthetic uevent
[ 913.668152] synth uevent: /devices/vio: failed to send uevent
[ 913.668200] vio vio: uevent: failed to send synthetic uevent
[ 914.001152] synth uevent: /devices/vio: failed to send uevent
[ 914.001231] vio vio: uevent: failed to send synthetic uevent
[ 926.728484] synth uevent: /devices/vio: failed to send uevent
[ 926.728594] vio vio: uevent: failed to send synthetic uevent
[ 927.736509] synth uevent: /devices/vio: failed to send uevent
[ 927.736576] vio vio: uevent: failed to send synthetic uevent
[ 958.949429] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[ 958.949489] CPU: 1 PID: 134834 Comm: txg_sync Kdump: loaded Tainted: P OE 5.15.41-pristine #1
[ 958.949535] Call Trace:
[ 958.949548] [c000000071a8ac30] [c00000000078dbd0] dump_stack_lvl+0x74/0xa8 (unreliable)
[ 958.949592] [c000000071a8ac70] [c00000000013ada8] panic+0x154/0x3dc
[ 958.949623] [c000000071a8ad00] [c000000000c81a3c] __schedule+0xb9c/0xba0
[ 958.949656] [c000000071a8add0] [c000000000c81c84] __cond_resched+0x64/0x90
[ 958.949690] [c000000071a8ae00] [c000000000c853e8] down_read+0x28/0x110
[ 958.949721] [c000000071a8ae30] [c008000009b63ff8] dnode_hold_impl+0x120/0x14e0 [zfs]
[ 958.949808] [c000000071a8af00] [c008000009b3b680] dmu_bonus_hold+0x58/0xe0 [zfs]
[ 958.949889] [c000000071a8af50] [c008000009b7e3f8] dsl_dataset_hold_obj+0x60/0xae0 [zfs]
[ 958.949971] [c000000071a8b0d0] [c008000009b7ecb8] dsl_dataset_hold_obj+0x920/0xae0 [zfs]
[ 958.950054] [c000000071a8b250] [c008000009b47b90] dmu_objset_find_dp_impl+0x158/0x510 [zfs]
[ 958.950136] [c000000071a8b310] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[ 958.950218] [c000000071a8b3d0] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[ 958.950299] [c000000071a8b490] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[ 958.950382] [c000000071a8b550] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[ 958.950464] [c000000071a8b610] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[ 958.950553] [c000000071a8b6d0] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[ 958.950641] [c000000071a8b790] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[ 958.950728] [c000000071a8b850] [c008000009b4b1d8] dmu_objset_find_dp+0x110/0x2d0 [zfs]
[ 958.950815] [c000000071a8b940] [c008000009b94bc0] dsl_dir_rename_check+0x1e8/0x710 [zfs]
[ 958.950902] [c000000071a8b9f0] [c008000009ba92b0] dsl_sync_task_sync+0xb8/0x1c0 [zfs]
[ 958.950991] [c000000071a8ba30] [c008000009b97578] dsl_pool_sync+0x520/0x6a0 [zfs]
[ 958.951079] [c000000071a8bb20] [c008000009bd55f4] spa_sync+0x62c/0x11f0 [zfs]
[ 958.951166] [c000000071a8bc90] [c008000009c001bc] txg_sync_thread+0x2b4/0x450 [zfs]
[ 958.951249] [c000000071a8bd60] [c0080000014ac2d0] thread_generic_wrapper+0x98/0xd0 [spl]
[ 958.951293] [c000000071a8bda0] [c0000000001749e0] kthread+0x180/0x190
[ 958.951328] [c000000071a8be10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
[ 958.951382] Sending IPI to other CPUs
[ 958.952928] IPI complete
[ 958.955298] kexec: Starting switchover sequence.
It sure looks like we overran the kernel stack size on ppc64. Can you check what the default stack size it? On x86_64 it's 16K and we've made a lot of changes over the years to make sure OpenZFS fits in that. If you can reproduce the issue and have CONFIG_STACK_TRACER enabled the kernel provides an interface to dump the worst case stack ever observed and the size of each frame. That should give you a good idea where all the space has been used.
@rincebrain @behlendorf I think this issue can be closed, because in the early BLAKE3 patches I didn't limit the stack size usage of the BLAKE3 update function... this was surely the reason for the panic.
...you do see that the backtrace of the overflowing stack has no BLAKE3 code anywhere in it? And it reproduces without the BLAKE3 PR?
Did you try it without BLAKE3? If so, I didn't checked this, sry. I have no real root access to ppc64 hardware, so I can't check this myself :/
It was ment as a hint, that my first BLAKE3 patches eat a lot of stack ... which was later fixed...
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.