ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler
Pull Request Description
@zhenggb72 reported performance degradation in inter-NUMA SHM communication when compare to v4.2.2. The issue was introduced in #7046. MPICH v4.2.2 was getting ~14us latency for 64KB message, but only getting ~23us latency after #7046. Setting MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLE=true solves the problem.
The issue is cause by a change in memcpy operation. v4.2.2 uses non-temporal store for both intra-NUMA and inter-NUMA SHM communication. This was change to regular memcpy when topo-aware is disabled. The change in memcpy was because non-temporal store has higher latency in intra-NUMA communications in some architectures (see below result on Milan). Also, the non-temporal store has higher latency in inter-NUMA small message in other architectures (skylake, cascade, icelake).
After more comprehensive testing on broadwell, skylake, cascade, icelake, sapphire rapids, and milan, I think it is probably OK to make the topo-aware default to enabled, which would yield better performance for sapphire rapids and milan. Details numbers can be found in following comments.
~~This PR also address another source of performance degradation observed when building with Intel compiler. PR#7074 consolidated SSE2 and AVX related optimization options into MPL's configure because only MPL explicitly use them. This change showed no performance degradation with GNU compiler. But, with Intel compilers, this does results in some performance degradation (see below). Therefore, we should add them back in the main configure. Currently, the main configure checks for availability of SSE2, AVX and AVX512F, and add them to CFLAGS. The MPL configure will further check for specific instructions that is used in MPL.~~ This is superceded by #7152.
All raw numbers: 2024-shm_bench-arch_comparison.xlsx
Author Checklist
- [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [ ] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
Inter-NUMA, Sunspot, icx
| main | stable422 | this PR | |
|---|---|---|---|
| 1 | 1.52 | 1.46 | 1.46 |
| 2 | 1.52 | 1.46 | 1.46 |
| 4 | 1.52 | 1.46 | 1.46 |
| 8 | 1.52 | 1.46 | 1.45 |
| 16 | 1.52 | 1.46 | 1.46 |
| 32 | 1.52 | 1.46 | 1.46 |
| 64 | 1.69 | 1.66 | 1.68 |
| 128 | 1.71 | 1.67 | 1.75 |
| 256 | 1.74 | 1.7 | 1.79 |
| 512 | 1.77 | 1.84 | 1.92 |
| 1024 | 1.84 | 1.91 | 1.99 |
| 2048 | 2.2 | 2.02 | 2.14 |
| 4096 | 3.03 | 2.39 | 2.4 |
| 8192 | 4.42 | 3.13 | 2.9 |
| 16384 | 9.67 | 6.9 | 6.24 |
| 32768 | 13.99 | 9.49 | 8.29 |
| 65536 | 23.08 | 14.18 | 12.04 |
| 131072 | 39.61 | 23.52 | 19.64 |
| 262144 | 70.3 | 41.65 | 34.58 |
| 524288 | 128.28 | 85.07 | 64.73 |
| 1048576 | 247.39 | 213.87 | 170.08 |
| 2097152 | 524 | 438.87 | 294.68 |
| 4194304 | 1086.84 | 822.32 | 564.34 |
Intra-NUMA, Sunspot, icx
| main | stable422 | this PR | |
|---|---|---|---|
| 1 | 0.84 | 0.76 | 0.79 |
| 2 | 0.84 | 0.76 | 0.79 |
| 4 | 0.84 | 0.76 | 0.79 |
| 8 | 0.84 | 0.76 | 0.79 |
| 16 | 0.84 | 0.76 | 0.79 |
| 32 | 0.84 | 0.76 | 0.8 |
| 64 | 0.9 | 0.84 | 0.87 |
| 128 | 0.92 | 0.85 | 0.88 |
| 256 | 0.95 | 0.88 | 0.91 |
| 512 | 0.99 | 1.04 | 0.95 |
| 1024 | 1.05 | 1.09 | 1.01 |
| 2048 | 1.27 | 1.26 | 1.22 |
| 4096 | 1.75 | 1.59 | 1.69 |
| 8192 | 2.52 | 2.24 | 2.56 |
| 16384 | 5.49 | 4.75 | 5.75 |
| 32768 | 7.37 | 6.92 | 7.19 |
| 65536 | 11.5 | 11.08 | 11.1 |
| 131072 | 19.48 | 18.92 | 18.3 |
| 262144 | 33.49 | 33.32 | 30.51 |
| 524288 | 65.09 | 58.02 | 65.58 |
| 1048576 | 135.79 | 117.81 | 139.51 |
| 2097152 | 261.86 | 258.09 | 216.93 |
| 4194304 | 539.92 | 482.54 | 477.29 |
AVX in MPICH configure vs AVX in MPL configure, intra-NUMA, Sunspot, icx
| MPICH configure | MPL configure | |
|---|---|---|
| 1 | 0.84 | 0.62 |
| 2 | 0.84 | 0.6 |
| 4 | 0.84 | 0.6 |
| 8 | 0.84 | 0.6 |
| 16 | 0.84 | 0.59 |
| 32 | 0.84 | 0.6 |
| 64 | 0.9 | 0.61 |
| 128 | 0.92 | 0.68 |
| 256 | 0.95 | 0.72 |
| 512 | 0.99 | 0.8 |
| 1024 | 1.05 | 0.94 |
| 2048 | 1.27 | 1.11 |
| 4096 | 1.75 | 1.51 |
| 8192 | 2.52 | 2.22 |
| 16384 | 5.49 | 5.2 |
| 32768 | 7.37 | 7.06 |
| 65536 | 11.5 | 11.08 |
| 131072 | 19.48 | 18.96 |
| 262144 | 33.49 | 32.38 |
| 524288 | 65.09 | 55.74 |
| 1048576 | 135.79 | 108.36 |
| 2097152 | 261.86 | 236.55 |
| 4194304 | 539.92 | 506.49 |
test:mpich/ch4/ofi
Inter-NUMA, TOPO enabled vs disabled, Intel Compiler. Note the higher latency for (< 4KB) in skylake-icelake for TOPO disabled.
| broadwell | skylake | cascade | icelake | sapphire rapids | |
|---|---|---|---|---|---|
| topo enabled | topo disabled | topo enabled | topo disabled | topo enabled | |
| 1 | 0.97 | 1.02 | 1.22 | 1.03 | 1.16 |
| 2 | 0.92 | 0.97 | 1.21 | 1.02 | 1.16 |
| 4 | 0.89 | 0.93 | 1.22 | 1.02 | 1.15 |
| 8 | 0.87 | 0.91 | 1.21 | 1.01 | 1.15 |
| 16 | 0.86 | 0.89 | 1.21 | 1.02 | 1.14 |
| 32 | 0.85 | 0.88 | 1.23 | 1.02 | 1.17 |
| 64 | 0.92 | 0.97 | 1.31 | 1.13 | 1.24 |
| 128 | 1 | 1 | 1.51 | 1.25 | 1.42 |
| 256 | 1.03 | 1.05 | 1.47 | 1.31 | 1.29 |
| 512 | 1.14 | 1.1 | 1.75 | 1.34 | 1.59 |
| 1024 | 1.24 | 1.24 | 1.85 | 1.41 | 1.63 |
| 2048 | 1.4 | 1.64 | 2.03 | 1.76 | 1.75 |
| 4096 | 1.7 | 2.2 | 2.19 | 2.4 | 1.94 |
| 8192 | 2.43 | 3.4 | 2.85 | 3.8 | 2.54 |
| 16384 | 5.53 | 8.27 | 6.34 | 8.33 | 5.85 |
| 32768 | 8.36 | 12.47 | 8.38 | 11.82 | 7.85 |
| 65536 | 13 | 20.79 | 12.4 | 19.35 | 11.64 |
| 131072 | 23.88 | 38.23 | 20.75 | 33.09 | 19.61 |
| 262144 | 44.44 | 72.53 | 37.51 | 58.67 | 35.56 |
| 524288 | 81.54 | 139.62 | 69.87 | 110.64 | 66.8 |
| 1048576 | 160.64 | 273.81 | 136.96 | 221.46 | 131.96 |
| 2097152 | 320.4 | 542.7 | 256.93 | 452.78 | 247.88 |
| 4194304 | 623.81 | 1080.62 | 500.76 | 916.99 | 493.26 |
test: mpich/ch4/ofi
test: mpich/ch4/ofi
I am adding one irrelevant commit for Jenkins testing. Jenkins refuses to run test as the original commit only have comment changes. Will remove it after the tests are cleared.
Tests are clean. Rebase to main and remove the irrelevant commit.
test: mpich/ch4/ofi