zfs
zfs copied to clipboard
Stop UB complaints and autovectorization of scalar fletcher4 implementation
(I'll rebase with a better description before merge assuming nobody has a reason this is a terrible idea, but since we're doing the opposite cast as input in a number of places, it seems no worse than status quo ante...)
Motivation and Context
Getting rid of "shhh nobody look it's fine we promise" excludes in the code is always good, plus with gcc defaulting to auto-vectorization at O2 now, the results can be...explosive (#13605 #13620).
Description
When we cast the input to fletcher_4_scalar_native
or friends to fletcher_4_ctx_t
, we're promising that the thing is up to 64B (!) aligned, as far as the compiler's concerned, but because we're not actually guaranteeing that, auto-vectorizing the code results in trying to do an aligned write to the item, and that is sometimes going to crash (in userland; since the kernel has big flashing NO DON'T around auto-vectorizing, it's probably just going to upset sanitizers).
But it turns out, everywhere we explicitly call the scalar implementation, we're just casting a zio_cksum_t *
to fletcher_4_ctx_t *
, so if that's ever incorrect behavior, we're going to horrifically crash a dozen ways to Sunday anyway.
I just dropped the forced alignment annotations, because it turns out all the implementations are written entirely in unaligned access instructions anyway, so if we really think that giving unaligned instructions aligned accesses is more performant in some spots, we can just do that there, rather than claiming it's aligned everywhere and then lying a few times.
How Has This Been Tested?
Before this change, using -ftree-vectorize -march=znver2
would result in a zfs
binary that crashed around half the time on trying to send on my 8700k or my 5900X, and removing the UBSan exceptions would result in erroring out hard immediately on send with or without the -ftree-vectorize
.
After this change, with or without the -ftree-vectorize
, UBSan as far as I can see has no complaints about these functions, and I haven't made it crash again.
Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [x] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by
.
I've pushed it just not having an alignment requirement, after both checking and testing that the implementations are entirely using unaligned instructions everywhere.
Apparently you can tell NEON that you promise an access is aligned and it'll then optimize based on it and fault if you lied, but we don't appear to.
So, as far as I can tell, the only way I can tell Clang to not do this is either:
#pragma clang loop vectorize(disable)
(which I can't turn into a preprocessor macro)
or a long platform-specific sea of __attribute__((target("no-sse,no-sse2,...")))
, the latter of which not being directly what I mean anyway.
So I guess Clang is why we can't have nice things today. Time to just mark the entire file with -fno-vectorize
for Clang and -fno-tree-vectorize
for gcc, and if people override that, they get to keep all the pieces...
Apparently you can tell NEON that you promise an access is aligned and it'll then optimize based on it and fault if you lied, but we don't appear to.
I've never seen gcc emit that decoration in the assembly even when it had absolute certainty of alignment. I have seen the performance benefit of aligned access but it was always automatic. I suspect much later implementations of aarch64 don't need that.
Unfortunately, it looks like this breaks the build on CentOS 7 and the old ppc64 builder due to the old gcc
version. Seems like a configure check will be needed for this.
gcc: error: unrecognized command line option '-mgeneral-regs-only'
Cute. I'll give it a go later.
Presently, there's not a PPC accelerated fletcher4 at all, and the alignment requirements were imposed per-arch based on what got included into the union definition, so this only affects x86 and ARM at the moment.
- Rich
On Sun, Jul 17, 2022 at 10:03 PM Adam Stylinski @.***> wrote:
@.**** commented on this pull request.
In lib/libzfs/libzfs_sendrecv.c https://github.com/openzfs/zfs/pull/13631#discussion_r922932481:
@@ -2060,12 +2060,12 @@ send_prelim_records(zfs_handle_t *zhp, const char *from, int fd, int err = 0; char *packbuf = NULL; size_t buflen = 0;
- zio_cksum_t zc = { {0} };
- zio_cksum_t *zc = calloc(sizeof (zio_cksum_t), 1);
So for an infrequent enough access, I don't think it's a significant win much anymore on newer CPUs. For AARCH64, particularly on "little" cores, aligned access is still quite a bit faster than unaligned. I get the impression this checksum is only written back to memory after summing a block's worth of bytes, so it's probably not going to be significant either way.
Now, on PowerPC, particularly the older, pre-power7 variants, alignment is a requirement or it takes two loads and a permute from a permutation vector returned from vec_lvsl. I'm not sure how far back OpenZFS's PowerPC support goes back and if it needs the VSX extensions or supports the less VMX/altivec ones.
— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/pull/13631#discussion_r922932481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7ILOV32RNYP2OQYV73VUS3Q5ANCNFSM523PCLZA . You are receiving this because you authored the thread.Message ID: @.***>