ch4/shm: Support topology-aware SHM communication
Pull Request Description
This PR adds the support of detecting node topology and runtime selection of regular/stream memcpy. The PR has four parts:
- Fixing existing MPMC queue and adding SSE2 version of stream copy for x86 arch without AVX support.
- Adding node topology detection. Each rank query topology with hwloc and allgather during SHM_post_init. Each rank calculates the topo distance of all its local ranks.
- Making SHMEM pool to support multiple free queues with different queue types. The number of cells per proc parameter is splited evenly among all free queues. Cell allocation function takes a queue id to indicate which free queue being used in allcoation. No change needed to cell deallcation.
- Add topology aware SHM communication. A new boolean env
MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLEis added to control this feature. It is disabled by default.
The benefit of using receiver side free queue and non-temporal (stream) store is speed up the inter-NUMA communication (or inter-L3-cache). But non-temporal store is bad for ranks that share the same L3 cache. By making the POSIX topology aware, we can choose non-temporal store inter-NUMA and inter-cache communications.
Performance
Setup: ALCF Sunspot, Intel Sapphire Rapids x 2. Intel ICX compiler
Configuration: --with-device=ch4:ofi --disable-fortran --disable-romio CC=clang CXX=clang++ --enable-fast=all,O3,avx
osu_latency
| TOPO_ENABLE=0 | TOPO_ENABLE=0 | TOPO_ENABLE=1 | TOPO_ENABLE=1 | |
|---|---|---|---|---|
| Msg Size | intraNUMA | interNUMA | intraNUMA | interNUMA |
| 1 | 0.58 | 1.09 | 0.59 | 1.05 |
| 2 | 0.58 | 1.09 | 0.58 | 1.04 |
| 4 | 0.58 | 1.08 | 0.59 | 1.04 |
| 8 | 0.58 | 1.09 | 0.58 | 1.06 |
| 16 | 0.57 | 1.09 | 0.58 | 1.06 |
| 32 | 0.58 | 1.1 | 0.59 | 1.09 |
| 64 | 0.63 | 1.18 | 0.64 | 1.18 |
| 128 | 0.68 | 1.3 | 0.68 | 1.34 |
| 256 | 0.71 | 1.4 | 0.72 | 1.44 |
| 512 | 0.75 | 1.47 | 0.74 | 1.72 |
| 1024 | 0.85 | 1.64 | 0.84 | 1.79 |
| 2048 | 1.11 | 2.05 | 1.08 | 1.9 |
| 4096 | 1.49 | 2.56 | 1.48 | 2.29 |
| 8192 | 2.24 | 3.5 | 2.23 | 2.86 |
| 16384 | 4.91 | 8.32 | 5.05 | 6.46 |
| 32768 | 6.73 | 12.62 | 6.7 | 8.64 |
| 65536 | 10.88 | 21.19 | 10.62 | 13.5 |
| 131072 | 18.26 | 37.17 | 18 | 23.24 |
| 262144 | 30.99 | 67.79 | 30.86 | 42.08 |
| 524288 | 56.46 | 123.46 | 50.95 | 80.75 |
| 1048576 | 135.11 | 239.86 | 104.53 | 166.44 |
| 2097152 | 261.07 | 505.44 | 224.46 | 348.91 |
| 4194304 | 500.43 | 1039.73 | 503 | 698.85 |
| 8388608 | 1048.04 | 2120.46 | 1071.69 | 1401.71 |
| 16777216 | 2133 | 4243.53 | 2186.18 | 2757.43 |
| 33554432 | 5696.85 | 8535.5 | 5632.25 | 5425.01 |
| 67108864 | 12425.72 | 17536.66 | 12387.28 | 14177.96 |
| 134217728 | 25695.55 | 36289.39 | 25594.35 | 31576.1 |
osu_bw
| TOPO_ENABLE=0 | TOPO_ENABLE=0 | TOPO_ENABLE=1 | TOPO_ENABLE=1 | |
|---|---|---|---|---|
| Msg Size | intraNUMA | interNUMA | intraNUMA | interNUMA |
| 1 | 4.39 | 1.72 | 4.45 | 1.99 |
| 2 | 9.07 | 3.61 | 8.73 | 3.95 |
| 4 | 17.93 | 6.95 | 17.84 | 7.75 |
| 8 | 36.08 | 14.49 | 35.68 | 16.81 |
| 16 | 71.85 | 28.75 | 71.08 | 31.76 |
| 32 | 144.03 | 56.92 | 142.21 | 70.94 |
| 64 | 244.67 | 121.6 | 240.56 | 132.9 |
| 128 | 462.41 | 227.82 | 471.39 | 307.58 |
| 256 | 906.91 | 533.24 | 897.68 | 604.37 |
| 512 | 1716.13 | 892.81 | 1692.81 | 476.3 |
| 1024 | 3167.69 | 1102.69 | 2966.03 | 861.42 |
| 2048 | 4342.1 | 1584.68 | 4403.79 | 1689.73 |
| 4096 | 5208.5 | 2443.06 | 5273.76 | 3184.43 |
| 8192 | 6663.38 | 3523.11 | 6759.22 | 5681.12 |
| 16384 | 6260.44 | 2816 | 6377.65 | 5384.48 |
| 32768 | 6464.07 | 3129.67 | 6573 | 6558.38 |
| 65536 | 6791.71 | 3280.46 | 6856.04 | 7306.86 |
| 131072 | 6971.54 | 3433.65 | 7052.65 | 7899.33 |
| 262144 | 7085.24 | 3511.14 | 7201.67 | 8034.9 |
| 524288 | 7493.31 | 3606.28 | 7393.26 | 8237.02 |
| 1048576 | 8012.75 | 4109.33 | 8279.48 | 9055.35 |
| 2097152 | 7353.98 | 4009.31 | 7022.39 | 10202.08 |
| 4194304 | 7094.37 | 4002.18 | 5989.3 | 8275.19 |
| 8388608 | 7358.1 | 4004.89 | 6111 | 8852 |
| 16777216 | 7391.18 | 4007.08 | 6509.18 | 9186.05 |
| 33554432 | 7389.94 | 4007.51 | 7239.55 | 9348.05 |
| 67108864 | 6081.54 | 4003.38 | 6206.13 | 9389.08 |
| 134217728 | 5568.93 | 3872.73 | 5611.83 | 5941.02 |
Author Checklist
- [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [ ] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/ofi
test:mpich/ch4/ofi
test:mpich/ch4/ofi
Looking at the manyrma2 failure.
Looking at the
manyrma2failure.
I think it is just a time-out due to active message path being too slow. It is not related to your PR.
Could you update the performance measurement since the last one uses MPMC free queue in the TOPO_ENABLE=0?
Also, please add commit messages in addition to a single line of title. The commit message should explain the changes, such as why and what.
I have updated the commit messages and rebased on the latest main. The results in the PR is also updated.
test:mpich/ch4/ofi
I have updated the commit messages and rebased on the latest
main. The results in the PR is also updated.
Thanks! The difference between enable and disable TOPO_ENABLE on intranuma are attributable to noise, right?
I have updated the commit messages and rebased on the latest
main. The results in the PR is also updated.Thanks! The difference between enable and disable
TOPO_ENABLEon intranuma are attributable to noise, right?
Yes. Those are noises.
test:mpich/ch4/ofi
@raffenet I have redo the topo detection code with the MPIR_hwtopo APIs.
test:mpich/ch4/ofi
Tests were clean. Rebased on latest main.