ompi icon indicating copy to clipboard operation
ompi copied to clipboard

mca/coll: Add any radix k for alltoall bruck algorithm

Open jiaxiyan opened this issue 1 year ago • 16 comments

This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k.

jiaxiyan avatar Apr 05 '24 23:04 jiaxiyan

I was planning to review it after they figure out the issue with replacing the blocking ompi_coll_base_sendrecv with non-blocking communications.

bosilca avatar Apr 29 '24 19:04 bosilca

@bosilca I don't follow this comment:

when the upper and lower level of the communication stack both try to generate multi-rail traffic things cant go well.

Are you suggesting that collectives shouldn't be making PML isend/irecv directly?

lrbison avatar May 07 '24 17:05 lrbison

No, I was wondering how would a multi-rail enabled collective component (which will indeed post multiple isend requests) interacts with a multi-rail enabled low-level communication library (which will split all large messages across multiple lanes) ?

bosilca avatar May 08 '24 08:05 bosilca

@jiaxiyan Please fix the build.

wenduwan avatar May 16 '24 23:05 wenduwan

We decide to keep the blocking sendrecv because the performance is slightly better than nonblocking. (See test data below) We need to wait for each isend to complete before we can copy back new data from recvbuf to tmpbuf, so it does not show much improvement.

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

ompi_coll_base_sendrecv

Message size k=2 k = 4 k = 8 k = 16 k = 32
1 146.11 201.24 290.62 426.96 847.91
2 141.23 201.72 295.76 431.24 844.11
4 146.65 207.2 290.8 438.15 847.54
8 154.93 211.39 299.4 439.36 859.98
16 280.06 222.51 319.33 445.43 860.43
32 348.79 474.72 355.11 500.69 872.44
64 544.38 603.89 779.21 576.34 916.67
128 1059.73 932.32 1054.84 1323.35 968.53
256 21981.76 1822.93 1765.59 1875.43 2212.38
512 24050.4 3830.67 3719 3146.81 3036.46
1024 33873.93 8328.25 7612.07 6758.61 5348.24
2048 59072.64 18545.69 15535.72 14898.76 10101.14
4096 80879.25 37798.9 33115.21 27867.95 19722.63
8192 137532.69 80270.04 63793.19 61073.62 38873.77

isend irecv

Message size k=2 k = 4 k = 8 k = 16 k = 32
1 138.96 204.74 291.22 428.89 847.15
2 141.24 205.62 290.97 430.1 848.62
4 145 204.63 295.42 438.76 851.13
8 156.82 211.84 298.93 444.47 865.3
16 279.29 223.89 324.6 449.07 862.43
32 350.09 479.04 357.18 502.31 880.38
64 540.79 601.15 799.61 583.52 932.97
128 1032.12 958.54 1087.27 1326.93 975.33
256 21958.61 1850.11 1753.25 1856.64 2210.36
512 23963.49 3837.31 3731 3241.24 3051.9
1024 33949.96 8348.78 7598.88 6802.96 5308.44
2048 59115.35 18520.23 15578.96 14939.42 10035.18
4096 80790.02 37814.46 33030 27908.15 19750.56
8192 137384.46 80003.8 63834.89 61007.52 38989.87

jiaxiyan avatar May 16 '24 23:05 jiaxiyan

Bruck's method is designed to work in upper_comm with only the local/node leaders who can communicate directly with any other ranks (in upper comm) and that every pair of ranks are equally distant. (details in this paper - https://ieeexplore.ieee.org/document/642949).

Bruck's method is designed for optimizing latency by sacrificing bandwidth. So you are not likely to have performance improvement for mid/large messages by using a larger k by limiting ranks_per_node to be 1. It should be used in other node-aware approach for the all-to-all communication for internode only. The procedure looks like the data on each node will be gathered and combined by the local/node leaders; internode all-to-all using Bruck's method; scatter data from local/node leaders to other ranks on the same node.

juntangc avatar May 17 '24 01:05 juntangc

@bosilca @lrbison Can you give this another review? Thanks!

jiaxiyan avatar May 21 '24 23:05 jiaxiyan

there should be some additions to the user guide accompanying this change set that explain the algorithm and the mca tuning parameters. other wise it will be practically useless to any one but you. the capability for user tuning is important and should not be neglected.

burlen avatar May 28 '24 15:05 burlen

note: this recent paper described a similar algorithm. they derive a mathematical model for selection of k, and found that a default value of k=sqrt(P), with P the comm size, works well.

burlen avatar May 28 '24 15:05 burlen

@bosilca I experimented with your proposal to pack/unpack data into contiguous buffer instead of creating a datatype in https://github.com/open-mpi/ompi/commit/1b7e2546332dcceac94ff19beb86134a65d772e9 I profiled the run and saw the extra local copy actually creates a lot of overhead and has a regression for small message sizes compared to creating and committing the datatype(see data below). So I think we should use the previous commit.

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

Message size create datatype pack into contiguous buffer
1 146.11 197.28
2 141.23 196.42
4 146.65 197.07
8 154.93 195.45
16 280.06 209.76
32 348.79 309.00
64 544.38 535.37
128 1059.73 879.60
256 21981.76 23591.02
512 24050.4 3805.94
1024 33873.93 32402.23
2048 59072.64 53316.38
4096 80879.25 119598.57
8192 137532.69 158336.34

jiaxiyan avatar Jun 18 '24 22:06 jiaxiyan

Thanks for taking a look. I have to say your results are concerning, because the extra overhead is terribly high, so high that I wonder how one small memcpy (for the 1 byte message as an example) could add 30ms. Would you mind sharing the code with me (maybe privately by email) ?

bosilca avatar Jun 19 '24 15:06 bosilca

@bosilca Did you have a chance to look at https://github.com/open-mpi/ompi/commit/1b7e2546332dcceac94ff19beb86134a65d772e9?

jiaxiyan avatar Jun 24 '24 17:06 jiaxiyan

@bosilca Since this PR is backward compatible and does not cause performance regression for the default k=2 case, I wonder if we can merge it and work on improvement as a separate project? Thanks!

wenduwan avatar Jul 09 '24 13:07 wenduwan

From a code perspective this looks ok. But I still question the need for this, from a logical perspective. If I look at the results posted earlier here, the k=2 behave great until we hit 128 bytes, when it jumps up to 20x justifying the use of a larger radix. Clearly that jump in performance is problematic, and it solely leads to the tuned poor performance. Where is that jump coming from ?

bosilca avatar Jul 09 '24 22:07 bosilca

@bosilca Makes sense. Jessie has rotated to another high-priority project. I will take a look at the latency jump.

wenduwan avatar Jul 10 '24 22:07 wenduwan

@bosilca I did an analysis of the algorithm with EFA. The weird performance bumps at some message sizes is a result of 2 factors:

  • The number of send-recv rounds at different radix: A higher radix requires more rounds, because each round carries less data.
  • The datatype size(message size) in each round.

Below is an example on 16 nodes with 64 cores. I chose:

  • ppn: 16, 32, 64
  • radix: 2, 4, 8, 16
128-byte MPI_Alltoall
16 nodes x 64 ppn Radix Total rounds Bytes per round
2 10 65,536
4 15 32,768
8 22 16,384(round 1- 21), 65,536(round 22)
16 33 81,92(round 1 - 30), 32,768(round 31 - 33)
16 nodes x 32 ppn Radix Total rounds Bytes per round
2 9 32,768
4 13 16,384(round 1 - 12), 32,768(round 13)
8 21 8,192
16 31 4,096(round 1 - 30), 32,768(round 31)
16 nodes x 16 ppn Radix Total rounds Bytes per round
2 8 16,384
4 12 8,192
8 17 4096(round 1 - 14), 8192(round 15 - 17)
16 30 2,048

The non-linear latency metrics we observed is the result of network layer, i.e. libfabric + EFA network. EFA uses a datagram protocol(SRD), which exposes MTU=8 KiB. For larger messages, libfabric will have to switch to other mechanisms:

  • Segmentation into < 8K messages, and send them one by one; or,
  • Use EFA device RDMA capability to initiate a one-sided operations. For this platform this can be done by RDMA read from the receiver side - note that this is a rendezvous process.

This means we cannot apply the theoretical latency formula in the Bruck paper, at least for EFA, due to the non-linear data transfer behavior.

wenduwan avatar Jul 12 '24 22:07 wenduwan