resolve-march-native icon indicating copy to clipboard operation
resolve-march-native copied to clipboard

P and E cores may return different values

Open danog opened this issue 11 months ago • 9 comments

resolve-march-native should use lscpu+taskset to pin the gcc process to E cores (i.e. those with the lowest maximum frequency), to provide binaries optimized for the least powerful CPUs on the system.

AFAIK, the instruction set does not change across E and P cores at least on Intel CPUs (in fact, Intel disabled AVX512 on P cores even if they do support it, precisely to avoid issues like these).

However, other settings like the cache size may change (verified on an i9-14900K), which may lead to suboptimal code generation (i.e. using P core config for E cores).

danog avatar Dec 05 '24 16:12 danog

@danog I have a vague idea of what this is about but I have a feeling a voice call would save us a ton of ping pong here. Would you be up for a voice call in in English or German? I'll consider "sorry, no" an okay answer, so no worries in that case, we'll just need a wall of text then.

hartwork avatar Dec 05 '24 16:12 hartwork

I'm okay for the wall of text :)

danog avatar Dec 05 '24 16:12 danog

@danog to better understand:

  • How would suggested GCC command args differ from a user point of view between E and P, can you give an example output?
  • Can you demo use of lscpu and taskset to target E and P cores respectively?
  • The feature seems to be non-trivial and I would have to find hardware to test this feature with:
    • Is this for fun or is there funding available for this feature?
    • Do you see any hosts at https://portal.cfarm.net/machines/list/ that have supporting hardware?
  • How will we reliably detect if the host supports the feature and is worth using a more complex approach?
  • When and why do you consider targetting the least powerful core to be the desired target, what are your underlying assumptions there?
  • How do know that GCC code will respect the current running process when figuring out what to optimize for?

hartwork avatar Dec 05 '24 16:12 hartwork

  • How would suggested GCC command args differ from a user point of view between E and P, can you give an example output?
  • Can you demo use of lscpu and taskset to target E and P cores respectively?

As mentioned in the original message, on an i9-14900K the cache line size changes:

[root@e536288f9523 /]# lscpu --all --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ       MHZ
  0    0      0    0 0:0:0:0          yes 5700.0000 800.0000 5627.8760
  1    0      0    0 0:0:0:0          yes 5700.0000 800.0000  800.0000
  2    0      0    1 4:4:1:0          yes 5700.0000 800.0000 5430.9512
  3    0      0    1 4:4:1:0          yes 5700.0000 800.0000  800.0000
  4    0      0    2 8:8:2:0          yes 5700.0000 800.0000 5690.1880
  5    0      0    2 8:8:2:0          yes 5700.0000 800.0000 5200.5322
  6    0      0    3 12:12:3:0        yes 5700.0000 800.0000 5700.0000
  7    0      0    3 12:12:3:0        yes 5700.0000 800.0000 5573.0542
  8    0      0    4 16:16:4:0        yes 5700.0000 800.0000 5700.0000
  9    0      0    4 16:16:4:0        yes 5700.0000 800.0000 5700.0000
 10    0      0    5 20:20:5:0        yes 5700.0000 800.0000 5640.1240
 11    0      0    5 20:20:5:0        yes 5700.0000 800.0000 5536.8691
 12    0      0    6 24:24:6:0        yes 6000.0000 800.0000 5420.0430
 13    0      0    6 24:24:6:0        yes 6000.0000 800.0000 5700.0000
 14    0      0    7 28:28:7:0        yes 6000.0000 800.0000 5561.5508
 15    0      0    7 28:28:7:0        yes 6000.0000 800.0000 5700.0000
 16    0      0    8 32:32:8:0        yes 4400.0000 800.0000 4402.0991
 17    0      0    9 33:33:8:0        yes 4400.0000 800.0000 4397.0781
 18    0      0   10 34:34:8:0        yes 4400.0000 800.0000 4400.1318
 19    0      0   11 35:35:8:0        yes 4400.0000 800.0000 4400.8540
 20    0      0   12 36:36:9:0        yes 4400.0000 800.0000 4398.7090
 21    0      0   13 37:37:9:0        yes 4400.0000 800.0000 4393.4512
 22    0      0   14 38:38:9:0        yes 4400.0000 800.0000 4395.5298
 23    0      0   15 39:39:9:0        yes 4400.0000 800.0000 4399.3579
 24    0      0   16 40:40:10:0       yes 4400.0000 800.0000 4393.9531
 25    0      0   17 41:41:10:0       yes 4400.0000 800.0000  800.0000
 26    0      0   18 42:42:10:0       yes 4400.0000 800.0000 4391.8730
 27    0      0   19 43:43:10:0       yes 4400.0000 800.0000 4397.1182
 28    0      0   20 44:44:11:0       yes 4400.0000 800.0000 4400.0059
 29    0      0   21 45:45:11:0       yes 4400.0000 800.0000 4416.6660
 30    0      0   22 46:46:11:0       yes 4400.0000 800.0000 4404.5049
 31    0      0   23 47:47:11:0       yes 4400.0000 800.0000 4399.6338
[root@e536288f9523 /]# taskset -c 0 resolve-march-native
-march=alderlake -mabm -mno-cldemote -mno-kl -mno-sgx -mno-widekl -mshstk --param=l1-cache-line-size=64 --param=l1-cache-size=48 --param=l2-cache-size=36864
[root@e536288f9523 /]# taskset -c 31 resolve-march-native
-march=alderlake -mabm -mno-cldemote -mno-kl -mno-sgx -mno-widekl -mshstk --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=36864
  • Is this for fun or is there funding available for this This was an issue I encountered at work, which I worked around myself, I just wanted to report it here. May or may not submit a PR myself, I created this issue just to track the existence of the issue, which casues non-determinstic output.

  • Do you see any hosts at https://portal.cfarm.net/machines/list/ that have supporting hardware? Any 12th+ gen intel will have a mix of P end E cores, I mainly see Haswell CPUs in there, so no.

  • How will we reliably detect if the host supports the feature and is worth using a more complex approach? lscpu and sort via max frequency is the simple but naïve way, a more reliable way would be to manually parse the CPUID bits.

  • When and why do you consider targetting the least powerful core to be the desired target, what are your underlying assumptions there? Simply put, there is no automatic way to force pinning of processes on certain cores depending on compilation flags. Ideally, setaffinity should get invoked directly on process startup, pinning the process to the CPU that exactly matches the CPUID feature flags of the processor used to compile the binary with -march=native, but since this isn't possible, to ensure full compatibilty the least powerful core should be used (generally this just leads to less optimized code, i.e. assuming wider caches when only smaller ones are available on E cores, but if in the future some way is found to unlock AVX512 instructions on P cores, this could lead to the compilation of actually incompatible code for E cores, if the P cores are targeted).

  • How do know that GCC code will respect the current running process when figuring out what to optimize for? Actually, that might also cause issues, since resolve-march-native omits some flags...

danog avatar Dec 09 '24 11:12 danog

@danog thanks your reply! I have things parts for a reply at the moment:

First, I find it interesting to see that the maximum CPU frequency output of lscpu is not stable across calls and so e.g. lscpu --all --extended | tail -n+2 | awk '{print $1, $9}' | sort -k2 -r is not stable with regard to what the fastest CPU thread is at querying time.

Second, I seriously wondering if that is a design flaw and potentially a bug on GCC side: they should be checking what the CPU can do not what the current process can do, no? If they made it so to give the user a control channel, an environment variable or command line option would probably be a better channel. Do you happen to be interested to take this to GCC upstream?

that might also cause issues, since resolve-march-native omits some flags...

Third, I did not understand that^^ last part. Could you elaborate?

hartwork avatar Dec 09 '24 15:12 hartwork

@danog PS: Thanks for raising awareness about the topic with me, reporting this as an issue was a good move :+1:

hartwork avatar Dec 09 '24 15:12 hartwork

I seriously wondering if that is a design flaw and potentially a bug on GCC side

I thought the same thing, and even sent an email to get an account on the bugtracker, but they haven't replied to that one yet :D

But on the other hand, it kind of isn't GCC's fault, or at least not entirely (even if it is weird that the same GCC command running at different times with -march=native will produce different binaries, saying goodbye to reproducible builds :); and in a way, there is still a way to pin GCC to a certain CPU, just indirectly...

Third, I did not understand that^^ last part. Could you elaborate?

Kind of a side-thought, I wasn't too sure if the result of _get_march_explicit_flag_set can also vary depending on the CPU it is run, but thinking about it again it should not, so ignore that :)

danog avatar Dec 09 '24 16:12 danog

I thought the same thing, and even sent an email to get an account on the bugtracker, but they haven't replied to that one yet :D

https://gcc.gnu.org/PR111768

Ultimately, it's pretty harmless, just irritating, given the information is used in very few places and it can never result in wrong-code anyway.

thesamesam avatar Jan 05 '25 00:01 thesamesam

@thesamesam thanks for the link! :+1:

hartwork avatar Jan 05 '25 00:01 hartwork

For the moment we tell people to run for x in $(seq 0 $(( $(nproc) - 1 )) ); do taskset -c $x resolve-march-native; done | sort | uniq -c to get info per core and just use the common part of all of them.

negril avatar Apr 10 '25 09:04 negril

@negril that makes good sense. Let me give back nproc --ignore=1.

hartwork avatar Apr 10 '25 13:04 hartwork

@negril update: I notice now that that's maybe the one place where nproc --ignore=1 makes things worse (while usually it helps against ending up with zero by mistake on single core machines). Nevermind then 😃

hartwork avatar Apr 10 '25 14:04 hartwork