Integration of Clear Linux patches
Clear Linux maintains a number of performance related patches for open source projects. There are quite a few: https://github.com/clearlinux-pkgs
It would be interesting to integrate these into GentooLTO somehow, either synced into the overlay or through a user install mechanism.
I didn't look thoroughly, but if I'm seeing it right, these are just patches in their git repo? It would be very easy to put the patches into /etc/portage/patches, is there some downside to this?
It appears that way, however I'm guessing they also fine tune compiler options as well. Also, I'm unsure if they use icc or gcc for their builds. I'm guessing it depends on the package. At the very least, we could include the source code patches, I think.
Example: https://github.com/clearlinux-pkgs/fftw/blob/master/fftw.spec
Interesting flag: -fno-semantic-interposition. I'll be adding this one to the defaults I think.
Other interesting things: they appear to enable -ffast-math on certain packages, or flags that -ffast-math enables at the very least. Also interesting: they force all functions to be aligned on 32 byte boundaries with -falign-functions=32. I wonder if they do that for AVX512 compatibility? It'd be interesting to see what the benefits are for overaligning functions, if any.
It appears that -falign-functions=32 has some kind of impact on autovectorization. By default on x86_64, it is set to 16.
I just tested an architecture that has AVX512 instructions:
> gcc -march=skylake-avx512 -flto -Ofast -Q --help=optimizer | grep falign-functions
-falign-functions [disabled]
-falign-functions= 16
It seems by default on x86_64, -falign-functions is always set 16 (as long as -O2 or higher is specified).
Phoronix claims that they build using GCC/Clang (differs from package to package).
Makes sense - icc probably doesn't support a lot of the GNU extensions that are used out there in the wild. Not to mention, GCC is highly competitive with ICC when the right options are used. I'm thinking I may update the recommendations for -falign-functions, not adding it by default but mentioning it in make.conf.lto. The reason being, I want to support more than just Intel processors, or even just x86_64, and this feels very much like an Intel-specific thing.
I heard that also Solus is using some of their optimizations.
@InBetweenNames one thing that clear linux does that isn't really necessary for us is function multiversioning. I think the linker links different functions based on eg. AVX support. since everyone here is probably building with -march=native we can get smaller binaries than Clear can, may have better LTO opportunities, etc.
It would be interesting to see if we can get Michael to bench lto-overlay vs. Clear, esp. once we steal some of their fancy tricks.
regarding -falign-functions I'd expect this to be about aligned jumps/instruction cache line reads. maybe RIP relative addressing? but once you are executing in the function instruction alignment is going to be off no matter what you do thanks to variable length instructions. even if AVX cared about instruction alignment, I'm not sure this would help.
is it possible that there is a more compact way to load immediates if they are 32-aligned, so function call sites are smaller?
Agreed -- in fact we should do even better than function multi versioning since we're compiling our system exactly tailored for the system it's running on. This means more opportunities for LTO all around. Not to mention, this should be highly portable across many architectures.
If one could link their system using mainly static libraries, I bet the LTO benefits would be even more profound. You can link-optimize across static library boundaries, and you can't do that with shared objects. I don't believe this is possible as-is however, since Portage seems to really prefer shared objects, and configure scripts, etc, also prefer shared objects.
I was wondering the same about -falign-functions today, but there are other -falign-* flags that affect those other cases you mention. I hadn't considered immediate operands however. It might make a difference if there's some static storage for a function as well. I've been looking around all day for more uses of -falign-functions=32 and I've been having serious trouble. I found a slide deck:
http://hpac.rwth-aachen.de/teaching/sem-accg-16/slides/08.Schmitz-GGC_Autovec.pdf
I also found a StackOverflow question that indirectly touches on it:
https://stackoverflow.com/questions/19470873/why-does-gcc-generate-15-20-faster-code-if-i-optimize-for-size-instead-of-speed
If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.
Without delving in the GCC internals, I can't find many resources that recommend this flag. I'll see if the Intel guys will shed some light on it.
More goodies: -fno-common
Line 481: https://github.com/clearlinux/autospec/blame/master/autospec/specfiles.py
In that commit, -fno-common is mentioned.
Detailed here: https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html -- seems beneficial when it would work.
I find it interesting they enable -fno-math-errno, -fno-trapping-math by default. It's not -ffast-math, but it's partway there.
>gcc -march=skylake-avx512 -flto -Ofast -Q --help=common | grep fcommon
-fcommon [enabled]
Even with the most aggressive optimization package, this is on by default.
Found when it was added: https://github.com/clearlinux/autospec/blame/a5260d7ce751774d46e0a957786d179456a14275/autospec/buildpattern.py
It was added by @fenrus75 for "high speed cases". Interesting.
I notice that on packages that are optimized for size, they enable -ffunction-sections and -fdata-sections for dead code removal, along with a -Wl,--gc-sections. However, these are two flags I want to research more before enabling by default -- I'm unsure how these interact with LTO. I assumed that LTO kind of did dead code elimination on its own, since the entire program would be visible at link time (minus definitions in shared objects).
After researching a bit more, it looks like -Wl,--gc-sections is a weak form of LTO, and it is often compared to full LTO like GentooLTO uses. I'm not sure if there's a benefit to using both at the same time.
https://lwn.net/Articles/741494/
OK -- so locally, I have enabled -fno-common and -fno-semantic-interposition and have started building a few packages with them. I'll try them out for a few days before pushing them. I've also emailed the Clear Linux developers about -falign-functions=32. If it turns out to be beneficial for some systems, I will add it as a recommendation but I won't enable it by default in the overlay -- it will be opt-in behaviour.
OK -- I think i figured it out:
https://software.intel.com/en-us/forums/intel-c-compiler/topic/635646
For more info:
https://lkml.org/lkml/2015/5/19/1009
It looks like the historical reason for -falign-functions=16 is:
The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle
From Agner Fog's docs:
https://www.agner.org/optimize/microarchitecture.pdf
See "Instruction Fetch" sections for details.
However, consider that cache lines are usually 64 bytes long -- depending on your processor. From Ingar's post:
So based on those measurements, I think we should do the exact opposite of my original patch that reduced alignment to 1 bytes, and increase kernel function address alignment from 16 bytes to the natural cache line size (64 bytes on modern CPUs).
As for why -falign-functions=32 was chosen? I have a feeling it's actually a compromise. See this reply by Linus: https://lkml.org/lkml/2015/5/19/1142
Is there some way to get gcc to take the size of the function into account? Because aligning a 16-byte or 32-byte function on a 64-byte alignment is just criminally nasty and wasteful.
So, for functions that are greater than the cache line size, aligning on a a cache line boundary makes the most sense. For functions that are less than the cache line size, this isn't ideal as it wastes I$ space. Of course, when inlining is taken into account, which is a much higher probability since we are using system-wide LTO, this whole discussion becomes moot. However, this is still a problem for shared objects.
Ideally, GCC/ld would be smarter about how it aligns functions.
Work has been done to this end: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 It has been merged in trunk!
So, looking at -falign-functions once again:
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance, -falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 aligns to the next 32-byte boundary only if this can be done by skipping 23 bytes or less.
In other words, -falign-functions=24 will align all functions to 32-byte boundaries except those that are 8 bytes in size or less.
And another goodie -flimit-function-alignment:
If this option is enabled, the compiler tries to avoid unnecessarily overaligning functions. It attempts to instruct the assembler to align by the amount specified by -falign-functions, but not to skip more bytes than the size of the function.
This flag is off by default!
Delving in the GCC source code in file gcc/config/i386/x86-64.h:
#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP) \
do { \
if ((LOG) != 0) { \
if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG)); \
else { \
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP)); \
/* Make sure that we have at least 8 byte alignment if > 8 byte \
alignment is preferred. */ \
if ((LOG) > 3 \
&& (1 << (LOG)) > ((MAX_SKIP) + 1) \
&& (MAX_SKIP) >= 7) \
fputs ("\t.p2align 3\n", (FILE)); \
} \
} \
} while (0)
and the calling code in gcc/varasm.h:
...
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
int align_log = align_functions_log;
#endif
int max_skip = align_functions - 1;
if (flag_limit_function_alignment && crtl->max_insn_address > 0
&& max_skip >= crtl->max_insn_address)
max_skip = crtl->max_insn_address - 1;
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip);
#else
ASM_OUTPUT_ALIGN (asm_out_file, align_functions_log);
#endif
}
So in the worst case, we still get 8-byte function alignment for functions that are smaller than falign_functions in size. So, with the default, you get at most 16 bytes alignment and at least 8 bytes alignment with -flimit-function-alignment. It would probably make more sense to make it the L1 cache line size bytes by default and at least 16 bytes with -flimit-function-alignment. This is a pretty trivial change to make:
#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP) \
do { \
if ((LOG) != 0) { \
if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG)); \
else { \
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP)); \
if ((1 << (LOG)) > ((MAX_SKIP) + 1)) \
{ \
/* Make sure that we have at least 16 byte alignment \
if > 16 byte alignment is preferred. */ \
if ((LOG) > 4 && (MAX_SKIP) >= 15) \
fputs ("\t.p2align 4\n", (FILE)); \
/* Make sure that we have at least 8 byte alignment if > 8 byte \
alignment is preferred. */ \
else if ((LOG) > 3 && (MAX_SKIP) >= 7) \
fputs ("\t.p2align 3\n", (FILE)); \
} \
} \
} \
} while (0)
The above should guarantee the following, for a function that takes b bytes, with -falign-functions=n and -flimit-function-alignment:
- If
b >= n: for sure will be aligned ton - If
n > 16and16 <= b < n: will be at least aligned to a 16 byte boundary - Otherwise, if
n > 8and8 <= b < n: will be at least aligned to a 8 byte boundary
The check is done in this order to prevent wasting space.
So, it seems to me we should be using -falign-functions=${L1ICACHELINESIZE} -flimit-function-alignment.
I will test out my GCC patch and if it works OK, I will submit it upstream.
nice detective work!
On Sat, Nov 3, 2018, 13:20 Shane Peelar <[email protected] wrote:
OK -- I think i figured it out:
https://software.intel.com/en-us/forums/intel-c-compiler/topic/635646
For more info:
https://lkml.org/lkml/2015/5/19/1009
It looks like the historical reason for -falign-functions=16 is:
The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle
From Agner Fog's docs:
https://www.agner.org/optimize/microarchitecture.pdf
See "Instruction Fetch" sections for details.
However, consider that cache lines are usually 64 bytes long -- depending on your processor. From Ingar's post:
So based on those measurements, I think we should do the exact opposite of my original patch that reduced alignment to 1 bytes, and increase kernel function address alignment from 16 bytes to the natural cache line size (64 bytes on modern CPUs).
As for why -falign-functions=32 was chosen? I have a feeling it's actually a compromise. See this reply by Linus: https://lkml.org/lkml/2015/5/19/1142
Is there some way to get gcc to take the size of the function into account? Because aligning a 16-byte or 32-byte function on a 64-byte alignment is just criminally nasty and wasteful.
So, for functions that are greater than the cache line size, aligning on a a cache line boundary makes the most sense. For functions that are less than the cache line size, this isn't ideal as it wastes I$ space. Of course, when inlining is taken into account, which is a much higher probability since we are using system-wide LTO, this whole discussion becomes moot. However, this is still a problem for shared objects.
Ideally, GCC/ld would be smarter about how it aligns functions.
Work has been done to this end: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 It has been merged in trunk!
So, looking at -falign-functions once again:
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance, -falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 aligns to the next 32-byte boundary only if this can be done by skipping 23 bytes or less.
In other words, -falign-functions=24 will align all functions to 32-byte boundaries except those that are 8 bytes in size or less.
And another goodie -flimit-function-alignment:
If this option is enabled, the compiler tries to avoid unnecessarily overaligning functions. It attempts to instruct the assembler to align by the amount specified by -falign-functions, but not to skip more bytes than the size of the function.
This flag is off by default!
Delving in the GCC source code in file gcc/config/i386/x86-64.h:
#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)
do {
if ((LOG) != 0) {
if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG));
else {
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP));
/* Make sure that we have at least 8 byte alignment if > 8 byte
alignment is preferred. */
if ((LOG) > 3
&& (1 << (LOG)) > ((MAX_SKIP) + 1)
&& (MAX_SKIP) >= 7)
fputs ("\t.p2align 3\n", (FILE));
}
}
} while (0)and the calling code in gcc/varasm.h:
... #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN int align_log = align_functions_log; #endif int max_skip = align_functions - 1; if (flag_limit_function_alignment && crtl->max_insn_address > 0 && max_skip >= crtl->max_insn_address) max_skip = crtl->max_insn_address - 1;
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip); #else ASM_OUTPUT_ALIGN (asm_out_file, align_functions_log); #endif }
So in the worst case, we still get 8-byte function alignment for functions that are smaller than falign_functions in size. So, with the default, you get at most 16 bytes alignment and at least 8 bytes alignment with -flimit-function-alignment. It would probably make more sense to make it the L1 cache line size bytes by default and at least 16 bytes with -flimit-function-alignment. This is a pretty trivial change to make:
#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)
do {
if ((LOG) != 0) {
if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG));
else {
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP));
if ((1 << (LOG)) > ((MAX_SKIP) + 1))
{
/* Make sure that we have at least 16 byte alignment
if > 16 byte alignment is preferred. /
if ((LOG) > 4 && (MAX_SKIP) >= 15)
fputs ("\t.p2align 4\n", (FILE));
/ Make sure that we have at least 8 byte alignment if > 8 byte
alignment is preferred. */
else if ((LOG) > 3 && (MAX_SKIP) >= 7)
fputs ("\t.p2align 3\n", (FILE));
}
}
}
} while (0)The above should guarantee the following, for a function that takes b bytes, with -falign-functions=n and -flimit-function-alignment:
- If b >= n: for sure will be aligned to n
- If n > 16 and 16 <= b < n: will be at least aligned to a 16 byte boundary
- Otherwise, if n > 8 and 8 <= b < n: will be at least aligned to a 8 byte boundary
The check is done in this order to prevent wasting space.
So, it seems to me we should be using -falign-functions=${L1ICACHELINESIZE} -flimit-function-alignment.
I will test out my GCC patch and if it works OK, I will submit it upstream.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/InBetweenNames/gentooLTO/issues/164#issuecomment-435617886, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB74uEntyyExWm03jNTiXJrGfz5RZ6Dks5urfqlgaJpZM4YLC7Z .
Heh, of course the GCC devs beat me to the punch.
https://github.com/gcc-mirror/gcc/commit/bc9f52f574c9dd2f620c12a5c651310327374da6
It looks like they are reworking the -flimit-function-alignment stuff in the next GCC version, given the commit message.
Thanks! In GCC trunk, we have this nice thing:
{
/* N2[:M2] is not specified. This arch has a default for N2.
Before -falign-foo=N:M:N2:M2 was introduced, x86 had a tweak.
-falign-functions=N with N > 8 was adding secondary alignment.
-falign-functions=10 was emitting this before every function:
.p2align 4,,9
.p2align 3
Now this behavior (and more) can be explicitly requested:
-falign-functions=16:10:8
Retain old behavior if N2 is missing: */
So, we may be able to say something like -falign-functions=64:48:16:8 which should:
- Align to 64 bytes if it can be done by skipping 48 bytes or less
- Align to 16 bytes if it can be done by skipping 8 bytes or less
Obviously these values would need to be tweaked. But it would give the desired result at least.
This doesn't appear to be documented anywhere however.
Okay, it is documented and I simply didn't look hard enough. It's hard to retain the old behaviour with the new method, since the secondary alignment will only be triggered if -flimit-function-alignment is not passed in. Sigh.
Got a response from Arjan van de Ven!
without going into too many cpu microarchitecture details... Intel cpus like hot code to start at a 32 byte boundary.
Very interesting. So, Ingo's findings confirm this to a degree, and suggest even stronger alignment requirements are beneficial. He says it best here: https://lkml.org/lkml/2015/5/21/443
I think, with -falign-functions=n and -flimit-function-alignment we get 90% of the way there, actually:
- Functions greater than n bytes are aligned to an
nbyte boundary - Functions less than
nbytes are tightly packed, unless they will cross anbyte boundary
For fun, here's an attempt to restore the functionality in my previous patch on GCC trunk:
/* Handle a user-specified function alignment.
Note that we still need to align to DECL_ALIGN, as above,
because ASM_OUTPUT_MAX_SKIP_ALIGN might not do any alignment at all. */
if (! DECL_USER_ALIGN (decl)
&& align_functions.levels[0].log > align
&& optimize_function_for_speed_p (cfun))
{
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
int max_skip1 = align_functions.levels[0].maxskip;
int max_skip2 = align_functions.levels[1].maxskip;
if (flag_limit_function_alignment)
{
if (crtl->max_insn_address > 0
&& max_skip1 >= crtl->max_insn_address)
max_skip1 = crtl->max_insn_address - 1;
if (crtl->max_insn_address > 0
&& max_skip2 >= crtl->max_insn_address)
max_skip2 = crtl->max_insn_address - 1;
}
ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
align_functions.levels[0].log,
max_skip1);
ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
align_functions.levels[1].log,
max_skip2);
#else
ASM_OUTPUT_ALIGN (asm_out_file, align_functions.levels[0].log);
#endif
}
So, with m < n, -falign-functions=n:n:m:m -flimit-function-alignment, for a b byte function would:
- If
n <= b, will be at least aligned to annbyte boundary - If
m <= b < n, will be at least aligned to ambyte boundary - If
b < m, if the function would cross the boundarym, it will be aligned tom - Otherwise, will use target default function alignment (unknown what this is defined as in GCC, but I suspect for x86_64 it is either 8 or 16 -- if anyone knows please let me know). If this is 0, then it's tightly packed.
Examples for the above would be n = 64 or n = 32 and m = 16
Obviously such a scheme would need benchmarks to show it's worth doing over the default. It could potentially waste space, too, since a function with m <= b < n may align to an n boundary, instead of a potentially closer m boundary. It's too bad Ingo's scheme is too hard to implement in a quick patch, as I've love to test his out.
Regardless of whether the default schemes or the one I posted above is used, based on what we have seen, n should be either 64 or 32 for Intel processors, and we may or may not want to tightly pack small functions with -flimit-function-alignment. We'd need benchmarks to show for sure what's worth enabling, but I think it's safe to go with Arjan van de Ven's choice of -falign-functions=32 for Intel processors and not tightly packing functions in the meantime. I will update README.md accordingly.
As I find this issue to be very interesting, I'd like to leave it up for discussion, especially in the hopes we get some benchmarks using combinations of these flags. My diff against GCC trunk for my own alignment scheme is below, in case anyone wants to try it on GCC trunk:
diff --git a/gcc/varasm.c b/gcc/varasm.c
index 545e13fef6a..6ed87298ec9 100644
--- a/gcc/varasm.c
+++ b/gcc/varasm.c
@@ -1809,19 +1809,24 @@ assemble_start_function (tree decl, const char *fnname)
&& optimize_function_for_speed_p (cfun))
{
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
- int align_log = align_functions.levels[0].log;
-#endif
- int max_skip = align_functions.levels[0].maxskip;
- if (flag_limit_function_alignment && crtl->max_insn_address > 0
- && max_skip >= crtl->max_insn_address)
- max_skip = crtl->max_insn_address - 1;
+ int max_skip1 = align_functions.levels[0].maxskip;
+ int max_skip2 = align_functions.levels[1].maxskip;
+ if (flag_limit_function_alignment)
+ {
+ if (crtl->max_insn_address > 0
+ && max_skip1 >= crtl->max_insn_address)
+ max_skip1 = crtl->max_insn_address - 1;
-#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
- ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip);
- if (max_skip == align_functions.levels[0].maxskip)
- ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
- align_functions.levels[1].log,
- align_functions.levels[1].maxskip);
+ if (crtl->max_insn_address > 0
+ && max_skip2 >= crtl->max_insn_address)
+ max_skip2 = crtl->max_insn_address - 1;
+ }
+ ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
+ align_functions.levels[0].log,
+ max_skip1);
+ ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
+ align_functions.levels[1].log,
+ max_skip2);
#else
ASM_OUTPUT_ALIGN (asm_out_file, align_functions.levels[0].log);
#endif
In addition to the function alignment, there's also data-alignment which takes a cacheline option: -malign-data=cacheline
I also build my system with: -mtls-dialect=gnu2
The Clear Linux "fast-math" options can be very beneficial to auto-vectorization, it gives many more opportunities than the default IEEE754 strict compliance.
@sjnewbury:
-mtls-dialect is a nice one!
-malign-data=cacheline too.
I got the impression that -malign-data, if changed, may not be compatible with code compiled with GCC 4.8 or older. Do you know if this one affects binary compatibility? If so, this would mostly affect closed source software, and possibly users of -bin packages. I see a number of recommendations for high performance code to use malign-data=cacheline, so this may be a non-issue at this time.
I've been hemming and hawing about the strict IEEE compliance myself, and I've decided we can support it as an opt-in enhancement. I know some users of this overlay are using it for scientific computations, and I don't want to interfere with that automatically.
One more thing: is malign-data documented in detail anywhere? I can't find much in the official GCC docs. I see references to Clear Linux using -malign-data=abi at one point. If necessary, I'll go look through the GCC code again.
I use the code from this bug report to benchmark -falign-functions: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58863 I just compile it with my cflags and then time ./align 32 32 When I see a difference I compile php 7.2 and run phpbench to see If I'm getting any benefit. For a 7900X and gcc 8.2(with clear linux patches applied) -falign-functions=8 seems to be the fastest.
Also you should look into --param inline-unit-growth=5 --param max-unroll-times=2 http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-2-firefox.html https://www.phoronix.com/forums/forum/software/programming-compilers/47966-intel-broadwell-gcc-4-9-vs-llvm-clang-3-5-compiler-benchmarks/page2 These two are still providing some benefit. -funroll-loops should be set with --param max-unroll-times=2 to get the improvement
Well, I don't want to use -funroll-loops and friends because those override the compiler's judgement. Even when you add in --param max-unroll-times=2 --param inline-unit-growth=5, you're still telling the compiler to do something unconditionally. Certain packages may benefit, but the idea is we should be letting the compiler decide what to do. Hence why -falign-functions=32 is documented as being an optional thing for Intel chips, based on my conversation with a Clear Linux developer.
Now of course, we may want to enable these flags on a per-package basis where improvement has been proven via benchmarks (as with your php example). Otherwise, it should be the defaults with possibly a few optional tweaks on a per package basis.
Is -falign-functions=32 helpful for all intel CPUs, even old ones?