zfs ZTS reproducibly panics on `zfs_rename_014

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	5.15.41
Architecture	ppc64el
OpenZFS Version	2cd0f98f4

Describe the problem you're observing

On trying to put the BLAKE3 PR through its paces, I hit a kernel panic on zfs_rename_014_neg. Confused, I tried running it against vanilla git, and lo, I had two panics.

Describe how to reproduce the problem

scripts/zfs-tests.sh -T zfs_rename on a ppc64el system, AFAICT.

Include any warning/errors/backtraces from the system logs

[  913.019952] synth uevent: /devices/vio: failed to send uevent
[  913.020021] vio vio: uevent: failed to send synthetic uevent
[  913.341239] synth uevent: /devices/vio: failed to send uevent
[  913.341316] vio vio: uevent: failed to send synthetic uevent
[  913.668152] synth uevent: /devices/vio: failed to send uevent
[  913.668200] vio vio: uevent: failed to send synthetic uevent
[  914.001152] synth uevent: /devices/vio: failed to send uevent
[  914.001231] vio vio: uevent: failed to send synthetic uevent
[  926.728484] synth uevent: /devices/vio: failed to send uevent
[  926.728594] vio vio: uevent: failed to send synthetic uevent
[  927.736509] synth uevent: /devices/vio: failed to send uevent
[  927.736576] vio vio: uevent: failed to send synthetic uevent
[  958.949429] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[  958.949489] CPU: 1 PID: 134834 Comm: txg_sync Kdump: loaded Tainted: P           OE     5.15.41-pristine #1
[  958.949535] Call Trace:
[  958.949548] [c000000071a8ac30] [c00000000078dbd0] dump_stack_lvl+0x74/0xa8 (unreliable)
[  958.949592] [c000000071a8ac70] [c00000000013ada8] panic+0x154/0x3dc
[  958.949623] [c000000071a8ad00] [c000000000c81a3c] __schedule+0xb9c/0xba0
[  958.949656] [c000000071a8add0] [c000000000c81c84] __cond_resched+0x64/0x90
[  958.949690] [c000000071a8ae00] [c000000000c853e8] down_read+0x28/0x110
[  958.949721] [c000000071a8ae30] [c008000009b63ff8] dnode_hold_impl+0x120/0x14e0 [zfs]
[  958.949808] [c000000071a8af00] [c008000009b3b680] dmu_bonus_hold+0x58/0xe0 [zfs]
[  958.949889] [c000000071a8af50] [c008000009b7e3f8] dsl_dataset_hold_obj+0x60/0xae0 [zfs]
[  958.949971] [c000000071a8b0d0] [c008000009b7ecb8] dsl_dataset_hold_obj+0x920/0xae0 [zfs]
[  958.950054] [c000000071a8b250] [c008000009b47b90] dmu_objset_find_dp_impl+0x158/0x510 [zfs]
[  958.950136] [c000000071a8b310] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[  958.950218] [c000000071a8b3d0] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[  958.950299] [c000000071a8b490] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[  958.950382] [c000000071a8b550] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[  958.950464] [c000000071a8b610] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[  958.950553] [c000000071a8b6d0] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[  958.950641] [c000000071a8b790] [c008000009b47de8] dmu_objset_find_dp_impl+0x3b0/0x510 [zfs]
[  958.950728] [c000000071a8b850] [c008000009b4b1d8] dmu_objset_find_dp+0x110/0x2d0 [zfs]
[  958.950815] [c000000071a8b940] [c008000009b94bc0] dsl_dir_rename_check+0x1e8/0x710 [zfs]
[  958.950902] [c000000071a8b9f0] [c008000009ba92b0] dsl_sync_task_sync+0xb8/0x1c0 [zfs]
[  958.950991] [c000000071a8ba30] [c008000009b97578] dsl_pool_sync+0x520/0x6a0 [zfs]
[  958.951079] [c000000071a8bb20] [c008000009bd55f4] spa_sync+0x62c/0x11f0 [zfs]
[  958.951166] [c000000071a8bc90] [c008000009c001bc] txg_sync_thread+0x2b4/0x450 [zfs]
[  958.951249] [c000000071a8bd60] [c0080000014ac2d0] thread_generic_wrapper+0x98/0xd0 [spl]
[  958.951293] [c000000071a8bda0] [c0000000001749e0] kthread+0x180/0x190
[  958.951328] [c000000071a8be10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
[  958.951382] Sending IPI to other CPUs
[  958.952928] IPI complete
[  958.955298] kexec: Starting switchover sequence.

May 21 '22 16:05 rincebrain

It sure looks like we overran the kernel stack size on ppc64. Can you check what the default stack size it? On x86_64 it's 16K and we've made a lot of changes over the years to make sure OpenZFS fits in that. If you can reproduce the issue and have CONFIG_STACK_TRACER enabled the kernel provides an interface to dump the worst case stack ever observed and the size of each frame. That should give you a good idea where all the space has been used.

May 24 '22 21:05 behlendorf

@rincebrain @behlendorf I think this issue can be closed, because in the early BLAKE3 patches I didn't limit the stack size usage of the BLAKE3 update function... this was surely the reason for the panic.

Aug 03 '22 08:08 mcmilk

...you do see that the backtrace of the overflowing stack has no BLAKE3 code anywhere in it? And it reproduces without the BLAKE3 PR?

Aug 03 '22 08:08 rincebrain

Did you try it without BLAKE3? If so, I didn't checked this, sry. I have no real root access to ppc64 hardware, so I can't check this myself :/

It was ment as a hint, that my first BLAKE3 patches eat a lot of stack ... which was later fixed...

Aug 03 '22 08:08 mcmilk

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Aug 07 '23 04:08 stale[bot]

zfs
zfs copied to clipboard

ZTS reproducibly panics on `zfs_rename_014_neg` on ppc64el

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs zfs copied to clipboard

ZTS reproducibly panics on `zfs_rename_014_neg` on ppc64el

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs
zfs copied to clipboard