resolve-march-native
resolve-march-native copied to clipboard
P and E cores may return different values
resolve-march-native should use lscpu+taskset to pin the gcc process to E cores (i.e. those with the lowest maximum frequency), to provide binaries optimized for the least powerful CPUs on the system.
AFAIK, the instruction set does not change across E and P cores at least on Intel CPUs (in fact, Intel disabled AVX512 on P cores even if they do support it, precisely to avoid issues like these).
However, other settings like the cache size may change (verified on an i9-14900K), which may lead to suboptimal code generation (i.e. using P core config for E cores).
@danog I have a vague idea of what this is about but I have a feeling a voice call would save us a ton of ping pong here. Would you be up for a voice call in in English or German? I'll consider "sorry, no" an okay answer, so no worries in that case, we'll just need a wall of text then.
I'm okay for the wall of text :)
@danog to better understand:
- How would suggested GCC command args differ from a user point of view between E and P, can you give an example output?
- Can you demo use of lscpu and taskset to target E and P cores respectively?
- The feature seems to be non-trivial and I would have to find hardware to test this feature with:
- Is this for fun or is there funding available for this feature?
- Do you see any hosts at https://portal.cfarm.net/machines/list/ that have supporting hardware?
- How will we reliably detect if the host supports the feature and is worth using a more complex approach?
- When and why do you consider targetting the least powerful core to be the desired target, what are your underlying assumptions there?
- How do know that GCC code will respect the current running process when figuring out what to optimize for?
- How would suggested GCC command args differ from a user point of view between E and P, can you give an example output?
- Can you demo use of lscpu and taskset to target E and P cores respectively?
As mentioned in the original message, on an i9-14900K the cache line size changes:
[root@e536288f9523 /]# lscpu --all --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 5700.0000 800.0000 5627.8760
1 0 0 0 0:0:0:0 yes 5700.0000 800.0000 800.0000
2 0 0 1 4:4:1:0 yes 5700.0000 800.0000 5430.9512
3 0 0 1 4:4:1:0 yes 5700.0000 800.0000 800.0000
4 0 0 2 8:8:2:0 yes 5700.0000 800.0000 5690.1880
5 0 0 2 8:8:2:0 yes 5700.0000 800.0000 5200.5322
6 0 0 3 12:12:3:0 yes 5700.0000 800.0000 5700.0000
7 0 0 3 12:12:3:0 yes 5700.0000 800.0000 5573.0542
8 0 0 4 16:16:4:0 yes 5700.0000 800.0000 5700.0000
9 0 0 4 16:16:4:0 yes 5700.0000 800.0000 5700.0000
10 0 0 5 20:20:5:0 yes 5700.0000 800.0000 5640.1240
11 0 0 5 20:20:5:0 yes 5700.0000 800.0000 5536.8691
12 0 0 6 24:24:6:0 yes 6000.0000 800.0000 5420.0430
13 0 0 6 24:24:6:0 yes 6000.0000 800.0000 5700.0000
14 0 0 7 28:28:7:0 yes 6000.0000 800.0000 5561.5508
15 0 0 7 28:28:7:0 yes 6000.0000 800.0000 5700.0000
16 0 0 8 32:32:8:0 yes 4400.0000 800.0000 4402.0991
17 0 0 9 33:33:8:0 yes 4400.0000 800.0000 4397.0781
18 0 0 10 34:34:8:0 yes 4400.0000 800.0000 4400.1318
19 0 0 11 35:35:8:0 yes 4400.0000 800.0000 4400.8540
20 0 0 12 36:36:9:0 yes 4400.0000 800.0000 4398.7090
21 0 0 13 37:37:9:0 yes 4400.0000 800.0000 4393.4512
22 0 0 14 38:38:9:0 yes 4400.0000 800.0000 4395.5298
23 0 0 15 39:39:9:0 yes 4400.0000 800.0000 4399.3579
24 0 0 16 40:40:10:0 yes 4400.0000 800.0000 4393.9531
25 0 0 17 41:41:10:0 yes 4400.0000 800.0000 800.0000
26 0 0 18 42:42:10:0 yes 4400.0000 800.0000 4391.8730
27 0 0 19 43:43:10:0 yes 4400.0000 800.0000 4397.1182
28 0 0 20 44:44:11:0 yes 4400.0000 800.0000 4400.0059
29 0 0 21 45:45:11:0 yes 4400.0000 800.0000 4416.6660
30 0 0 22 46:46:11:0 yes 4400.0000 800.0000 4404.5049
31 0 0 23 47:47:11:0 yes 4400.0000 800.0000 4399.6338
[root@e536288f9523 /]# taskset -c 0 resolve-march-native
-march=alderlake -mabm -mno-cldemote -mno-kl -mno-sgx -mno-widekl -mshstk --param=l1-cache-line-size=64 --param=l1-cache-size=48 --param=l2-cache-size=36864
[root@e536288f9523 /]# taskset -c 31 resolve-march-native
-march=alderlake -mabm -mno-cldemote -mno-kl -mno-sgx -mno-widekl -mshstk --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=36864
-
Is this for fun or is there funding available for this This was an issue I encountered at work, which I worked around myself, I just wanted to report it here. May or may not submit a PR myself, I created this issue just to track the existence of the issue, which casues non-determinstic output.
-
Do you see any hosts at https://portal.cfarm.net/machines/list/ that have supporting hardware? Any 12th+ gen intel will have a mix of P end E cores, I mainly see Haswell CPUs in there, so no.
-
How will we reliably detect if the host supports the feature and is worth using a more complex approach? lscpu and sort via max frequency is the simple but naïve way, a more reliable way would be to manually parse the CPUID bits.
-
When and why do you consider targetting the least powerful core to be the desired target, what are your underlying assumptions there? Simply put, there is no automatic way to force pinning of processes on certain cores depending on compilation flags. Ideally, setaffinity should get invoked directly on process startup, pinning the process to the CPU that exactly matches the CPUID feature flags of the processor used to compile the binary with -march=native, but since this isn't possible, to ensure full compatibilty the least powerful core should be used (generally this just leads to less optimized code, i.e. assuming wider caches when only smaller ones are available on E cores, but if in the future some way is found to unlock AVX512 instructions on P cores, this could lead to the compilation of actually incompatible code for E cores, if the P cores are targeted).
-
How do know that GCC code will respect the current running process when figuring out what to optimize for? Actually, that might also cause issues, since resolve-march-native omits some flags...
@danog thanks your reply! I have things parts for a reply at the moment:
First, I find it interesting to see that the maximum CPU frequency output of lscpu is not stable across calls and so e.g. lscpu --all --extended | tail -n+2 | awk '{print $1, $9}' | sort -k2 -r is not stable with regard to what the fastest CPU thread is at querying time.
Second, I seriously wondering if that is a design flaw and potentially a bug on GCC side: they should be checking what the CPU can do not what the current process can do, no? If they made it so to give the user a control channel, an environment variable or command line option would probably be a better channel. Do you happen to be interested to take this to GCC upstream?
that might also cause issues, since resolve-march-native omits some flags...
Third, I did not understand that^^ last part. Could you elaborate?
@danog PS: Thanks for raising awareness about the topic with me, reporting this as an issue was a good move :+1:
I seriously wondering if that is a design flaw and potentially a bug on GCC side
I thought the same thing, and even sent an email to get an account on the bugtracker, but they haven't replied to that one yet :D
But on the other hand, it kind of isn't GCC's fault, or at least not entirely (even if it is weird that the same GCC command running at different times with -march=native will produce different binaries, saying goodbye to reproducible builds :); and in a way, there is still a way to pin GCC to a certain CPU, just indirectly...
Third, I did not understand that^^ last part. Could you elaborate?
Kind of a side-thought, I wasn't too sure if the result of _get_march_explicit_flag_set can also vary depending on the CPU it is run, but thinking about it again it should not, so ignore that :)
I thought the same thing, and even sent an email to get an account on the bugtracker, but they haven't replied to that one yet :D
https://gcc.gnu.org/PR111768
Ultimately, it's pretty harmless, just irritating, given the information is used in very few places and it can never result in wrong-code anyway.
@thesamesam thanks for the link! :+1:
For the moment we tell people to run for x in $(seq 0 $(( $(nproc) - 1 )) ); do taskset -c $x resolve-march-native; done | sort | uniq -c to get info per core and just use the common part of all of them.
@negril that makes good sense. Let me give back nproc --ignore=1.
@negril update: I notice now that that's maybe the one place where nproc --ignore=1 makes things worse (while usually it helps against ending up with zero by mistake on single core machines). Nevermind then 😃