Linux: zpl_tmpfile: ensure ACL init succeeds before d_tmpfile
Motivation and Context
Other than #16612, this is the only place I see where trouble similar to #16608 could arise. One hint pointing to tmpfile is that the crashes are now way more frequent with 6.10 kernel, which introduced O_TMPFILE support for overlayfs, which agrees with that with our workload, if there's a general bug with O_TMPFILE, we'd be hitting it lot more often.
I'm not sure all the supported kernel versions will be okay with these modifications, I'm posting this as a draft to let the bots chew on it to find out.
Description
Linux: zpl_tmpfile: ensure ACL init succeeds before d_tmpfile
Inode should be linked to dentry and hashed only after both
zpl_xattr_security_init and zpl_init_acl were successful, which
apparently isn't always true for example when memory is tight.
How Has This Been Tested?
Still waiting for vpsAdminOS container templates build to complete, if it does, then this fixes #16608. The major PITA here is that it takes 8.5hrs for it to fail currently :D
Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [ ] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
As it turns out, this along with #16612 is insufficient, now I see the VERIFY tripped, the igrab in zget still can work out to NULL. So it's not fixing my problem :-D
cannot open 'testpool': pool I/O is currently suspended <- this is interesting, but I think unrelated to what I'm doing in the PR? will force-push to run the tests again
Oh I like this one, from alma9: (fwiw I think it's unrelated but hits at dnode dirty detection that still is AFAIK unsolved)
[ 3768.354941] ZTS run /usr/share/zfs/zfs-tests/tests/functional/cp_files/cp_stress
[ 3887.080148] (node->next == LIST_POISON1) is equivalent to (node->prev == LIST_POISON2)
[ 3887.080152] PANIC at list.h:187:list_link_active()
[ 3887.084495] Showing stack for process 566921
[ 3887.086033] CPU: 2 PID: 566921 Comm: seekflood Tainted: P OE ------- --- 5.14.0-427.37.1.el9_4.x86_64 #1
[ 3887.089643] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 3887.094929] Call Trace:
[ 3887.096030] <TASK>
[ 3887.097275] dump_stack_lvl+0x34/0x48
[ 3887.098626] spl_panic+0xd1/0xe9 [spl]
[ 3887.100051] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.101621] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.103386] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.104841] ? dbuf_rele_and_unlock+0x270/0x520 [zfs]
[ 3887.112559] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.114002] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.116556] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.119314] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.123656] ? dnode_hold_impl+0x8df/0x10c0 [zfs]
[ 3887.127820] spl_assert+0x17/0x20 [zfs]
[ 3887.131812] list_link_active+0x53/0x70 [zfs]
[ 3887.136131] dnode_is_dirty+0x5d/0x190 [zfs]
[ 3887.140516] dmu_offset_next+0x67/0x210 [zfs]
[ 3887.145355] zfs_holey_common+0xa0/0x1a0 [zfs]
[ 3887.149392] zfs_holey+0x4b/0x80 [zfs]
[ 3887.153150] zpl_llseek+0x89/0xc0 [zfs]
[ 3887.157083] ksys_lseek+0x68/0xb0
[ 3887.159401] do_syscall_64+0x5c/0x90
[ 3887.162380] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.166402] ? syscall_exit_work+0x103/0x130
[ 3887.170262] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.173055] ? syscall_exit_to_user_mode+0x22/0x40
[ 3887.177277] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.182171] ? do_syscall_64+0x69/0x90
[ 3887.184482] ? do_syscall_64+0x69/0x90
[ 3887.187779] ? do_syscall_64+0x69/0x90
[ 3887.190170] ? syscall_exit_to_user_mode+0x22/0x40
[ 3887.193358] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3887.195979] ? do_syscall_64+0x69/0x90
[ 3887.197177] ? do_syscall_64+0x69/0x90
[ 3887.198384] ? sysvec_apic_timer_interrupt+0x3c/0x90
[ 3887.199916] entry_SYSCALL_64_after_hwframe+0x72/0xdc
Did this end up fixing your crashes?
@satmandu nope, current progress here https://github.com/openzfs/zfs/issues/16324#issuecomment-2445472560