kernel_compiler_patch
kernel_compiler_patch copied to clipboard
Safety concerns regarding compiler optimizations that introduce FPU/SIMD code that may bypass kernel_fpu_begin/end() guards designed to prevent the kernel from clobbering userspace FP/SIMD register context
Howdy!
Nifty patch set! I’d like to use it since I yak-shaved essentially the same thing for Rocket Lake a couple days ago before I found this repo.
I am concerned though that without adding extra feature disable flags like -mno-avx2 -mno-avx512f flags, the compilers might sneak in use of these instructions and bypass the kernel_fpu_begin/end() guards that, to my understanding, are used to ensure the kernel doesn’t clobber the userspace FPU/SIMD context. That is, in-kernel use of FPU/SIMD must be explicitly handled via these calls (or kernel_fpu_begin_mask()) to ensure the userspace context is saved and restored before return to userland. Other related identifiers are TIF_NEED_FPU_LOAD, fpregs_mark_activate(), switch_fpu_return(), __cpu_invalidate_fpregs_state(), etc.
And example of kernel code using AVX2 illustrating the careful preservation of userspace context can be found in nft_set_pipapo_avx2.c’s nft_pipapo_avx2_lookup().
While it would be cool to let the compiler run wild and utilize all the extensions available to it for a given µarch, I think that would require compiler modifications/plugins to call the appropriate kernel API’s when it wants to use instructions that modify the FPU/SIMD register context. That doesn’t exist today and it’s unclear to me if it can be correctly inferred where it would be safe to use given that it seems that use of those APIs disables preemption in at least some cases. Additionally, even if it was possible, it could end up being that many more context switches into the kernel would start saving/restoring the FPU/SIMD context and the overhead of that would negate any gains achieved by the use of the extensions in kernel mode.
Nevertheless, even if the use of -mo-xxx flags prevents use of automatic SIMD optimizations in the kernel, the -march flags should still provide some benefit from better scheduling models and use of other features like enabling fast rep movsb codegen that don’t clobber FPU/SIMD context.
Please reference my patch (modifications to cflags and rustflags in x86/Makefile and a modification to cflags in x86/Makefile_32.cpu that involves a make call tune that I haven’t dived into and groked yet) for what I hope is a safe way to implement these -march enhancements, at least for the rocketlake case (though I could very well be missing other necessary -mno-xxx flags!).
I'm afraid most of what you're citing above is over my head. I think I understand at a conceptual level, but not enough to know the implications of adding flags. @sirlucjan @AdelKS @montjoie ?
Hey!
If you check this line in the kernel's make file https://github.com/gregkh/linux/blob/master/arch/x86/Makefile#L70
#
# Prevent GCC from generating any FP code by mistake.
#
# This must happen before we try the -mpreferred-stack-boundary, see:
#
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383
#
KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx
KBUILD_RUSTFLAGS += -Ctarget-feature=-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2
It is already adding the safeguards you are referring to, so yeah the patch in this repo does not actually end up enabling those the CPU instructions. This repo would've been flooded with issues long time ago haha.
Thanks for pointing that out, @AdelKS
I just went and checked the patch @jevinskie referenced and he is aware of those lines in the Makefile and adds some extra flags to disable newer AVX instructions (-mno-avx2 and friends...)
It doesn't seem to be documented in GCC's docs but -mno-avx disables every ulterior version of avx (that builds on top) and I think that's why it's the only flag disabled in the kernel's Makefile
Here's an example in Godbolt
I lost a comment mentioning I did a -march=native -Ofast -ftree-vectorize -save-temps build with a defconfig trying to goad gcc into being super aggressive and didn’t see any ymm or zmm registers generated. Still need to test with an allyes config. I had a suspicion that a -mno-avx might inhibit further SIMD usage as @AdelKS points out but I’ll have to dig deeper into gcc and clang to see if that’s truly/always the case.
Hey!
It is already adding the safeguards you are referring to, so yeah the patch in this repo does not actually end up enabling those the CPU instructions. This repo would've been flooded with issues long time ago haha.
I also figured some sort of issue would have popped up by now but I’m just a bit paranoid.
I think the Linux kernel is relying on the fact that -mno-avx disables all the other ones. Otherwise they would've updated those lines ? Or maybe that's what everyone is thinking, kernel devs included hehe
I think the Linux kernel is relying on the fact that
-mno-avxdisables all the other ones. Otherwise they would've updated those lines ? Or maybe that's what everyone is thinking, kernel devs included hehe
I was wondering how -mno-avx ended up in there when -march=core2 and -march=atom (alias for Bonnell) are the latest specific µarchs the kernel provides options for and they don’t have AVX. Turns out it was paranoia. Still digging into if -mno-EARLIERFEATURE implies all of the -mno-LATERFEATUREs in both GCC and Clang. I hope to have a definitive answer today since chasing down and adding all of them would be a pain (there are a lot) and it is a moving target. Not to mention old compilers will probably scoff at -mno-avx10.1 so you’d have to introduce build time make calls to see if they’re supported by the compiler (as used to be the case for -mno-avx before the minimum GCC/Clang versions were bumped and checking for support was unnecessary). Though I imagine users of this patch set tend to use newer compilers as well. ;)
We are also disabling avx2 and avx512 via the makefile for savety, since we provide x86-64-v3 and x86-64-v4 kernels: https://github.com/CachyOS/linux/commit/9916767f53b783eeb0cda00e29354afb7a531044
We havent noticed any performance degradation from this, yet.