mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/shm: Support topology-aware SHM communication

Open yfguo opened this issue 1 year ago • 2 comments

Pull Request Description

This PR adds the support of detecting node topology and runtime selection of regular/stream memcpy. The PR has four parts:

  1. Fixing existing MPMC queue and adding SSE2 version of stream copy for x86 arch without AVX support.
  2. Adding node topology detection. Each rank query topology with hwloc and allgather during SHM_post_init. Each rank calculates the topo distance of all its local ranks.
  3. Making SHMEM pool to support multiple free queues with different queue types. The number of cells per proc parameter is splited evenly among all free queues. Cell allocation function takes a queue id to indicate which free queue being used in allcoation. No change needed to cell deallcation.
  4. Add topology aware SHM communication. A new boolean env MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLE is added to control this feature. It is disabled by default.

The benefit of using receiver side free queue and non-temporal (stream) store is speed up the inter-NUMA communication (or inter-L3-cache). But non-temporal store is bad for ranks that share the same L3 cache. By making the POSIX topology aware, we can choose non-temporal store inter-NUMA and inter-cache communications.

Performance

Setup: ALCF Sunspot, Intel Sapphire Rapids x 2. Intel ICX compiler Configuration: --with-device=ch4:ofi --disable-fortran --disable-romio CC=clang CXX=clang++ --enable-fast=all,O3,avx

osu_latency

  TOPO_ENABLE=0 TOPO_ENABLE=0 TOPO_ENABLE=1 TOPO_ENABLE=1
Msg Size intraNUMA interNUMA intraNUMA interNUMA
1 0.58 1.09 0.59 1.05
2 0.58 1.09 0.58 1.04
4 0.58 1.08 0.59 1.04
8 0.58 1.09 0.58 1.06
16 0.57 1.09 0.58 1.06
32 0.58 1.1 0.59 1.09
64 0.63 1.18 0.64 1.18
128 0.68 1.3 0.68 1.34
256 0.71 1.4 0.72 1.44
512 0.75 1.47 0.74 1.72
1024 0.85 1.64 0.84 1.79
2048 1.11 2.05 1.08 1.9
4096 1.49 2.56 1.48 2.29
8192 2.24 3.5 2.23 2.86
16384 4.91 8.32 5.05 6.46
32768 6.73 12.62 6.7 8.64
65536 10.88 21.19 10.62 13.5
131072 18.26 37.17 18 23.24
262144 30.99 67.79 30.86 42.08
524288 56.46 123.46 50.95 80.75
1048576 135.11 239.86 104.53 166.44
2097152 261.07 505.44 224.46 348.91
4194304 500.43 1039.73 503 698.85
8388608 1048.04 2120.46 1071.69 1401.71
16777216 2133 4243.53 2186.18 2757.43
33554432 5696.85 8535.5 5632.25 5425.01
67108864 12425.72 17536.66 12387.28 14177.96
134217728 25695.55 36289.39 25594.35 31576.1

osu_bw

  TOPO_ENABLE=0 TOPO_ENABLE=0 TOPO_ENABLE=1 TOPO_ENABLE=1
Msg Size intraNUMA interNUMA intraNUMA interNUMA
1 4.39 1.72 4.45 1.99
2 9.07 3.61 8.73 3.95
4 17.93 6.95 17.84 7.75
8 36.08 14.49 35.68 16.81
16 71.85 28.75 71.08 31.76
32 144.03 56.92 142.21 70.94
64 244.67 121.6 240.56 132.9
128 462.41 227.82 471.39 307.58
256 906.91 533.24 897.68 604.37
512 1716.13 892.81 1692.81 476.3
1024 3167.69 1102.69 2966.03 861.42
2048 4342.1 1584.68 4403.79 1689.73
4096 5208.5 2443.06 5273.76 3184.43
8192 6663.38 3523.11 6759.22 5681.12
16384 6260.44 2816 6377.65 5384.48
32768 6464.07 3129.67 6573 6558.38
65536 6791.71 3280.46 6856.04 7306.86
131072 6971.54 3433.65 7052.65 7899.33
262144 7085.24 3511.14 7201.67 8034.9
524288 7493.31 3606.28 7393.26 8237.02
1048576 8012.75 4109.33 8279.48 9055.35
2097152 7353.98 4009.31 7022.39 10202.08
4194304 7094.37 4002.18 5989.3 8275.19
8388608 7358.1 4004.89 6111 8852
16777216 7391.18 4007.08 6509.18 9186.05
33554432 7389.94 4007.51 7239.55 9348.05
67108864 6081.54 4003.38 6206.13 9389.08
134217728 5568.93 3872.73 5611.83 5941.02

Author Checklist

  • [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [ ] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

yfguo avatar Jun 29 '24 05:06 yfguo

test:mpich/ch4/ofi

yfguo avatar Jun 29 '24 05:06 yfguo

test:mpich/ch4/ofi

yfguo avatar Jun 29 '24 15:06 yfguo

test:mpich/ch4/ofi

hzhou avatar Jul 01 '24 16:07 hzhou

Looking at the manyrma2 failure.

yfguo avatar Jul 01 '24 22:07 yfguo

Looking at the manyrma2 failure.

I think it is just a time-out due to active message path being too slow. It is not related to your PR.

Could you update the performance measurement since the last one uses MPMC free queue in the TOPO_ENABLE=0?

Also, please add commit messages in addition to a single line of title. The commit message should explain the changes, such as why and what.

hzhou avatar Jul 01 '24 22:07 hzhou

I have updated the commit messages and rebased on the latest main. The results in the PR is also updated.

yfguo avatar Jul 02 '24 02:07 yfguo

test:mpich/ch4/ofi

yfguo avatar Jul 02 '24 02:07 yfguo

I have updated the commit messages and rebased on the latest main. The results in the PR is also updated.

Thanks! The difference between enable and disable TOPO_ENABLE on intranuma are attributable to noise, right?

hzhou avatar Jul 02 '24 02:07 hzhou

I have updated the commit messages and rebased on the latest main. The results in the PR is also updated.

Thanks! The difference between enable and disable TOPO_ENABLE on intranuma are attributable to noise, right?

Yes. Those are noises.

yfguo avatar Jul 02 '24 15:07 yfguo

test:mpich/ch4/ofi

yfguo avatar Jul 02 '24 20:07 yfguo

@raffenet I have redo the topo detection code with the MPIR_hwtopo APIs.

yfguo avatar Jul 02 '24 20:07 yfguo

test:mpich/ch4/ofi

yfguo avatar Jul 02 '24 22:07 yfguo

Tests were clean. Rebased on latest main.

yfguo avatar Jul 03 '24 02:07 yfguo