mca/coll: Add any radix k for alltoall bruck algorithm
This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k.
I was planning to review it after they figure out the issue with replacing the blocking ompi_coll_base_sendrecv with non-blocking communications.
@bosilca I don't follow this comment:
when the upper and lower level of the communication stack both try to generate multi-rail traffic things cant go well.
Are you suggesting that collectives shouldn't be making PML isend/irecv directly?
No, I was wondering how would a multi-rail enabled collective component (which will indeed post multiple isend requests) interacts with a multi-rail enabled low-level communication library (which will split all large messages across multiple lanes) ?
@jiaxiyan Please fix the build.
We decide to keep the blocking sendrecv because the performance is slightly better than nonblocking. (See test data below) We need to wait for each isend to complete before we can copy back new data from recvbuf to tmpbuf, so it does not show much improvement.
OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)
ompi_coll_base_sendrecv
| Message size | k=2 | k = 4 | k = 8 | k = 16 | k = 32 |
|---|---|---|---|---|---|
| 1 | 146.11 | 201.24 | 290.62 | 426.96 | 847.91 |
| 2 | 141.23 | 201.72 | 295.76 | 431.24 | 844.11 |
| 4 | 146.65 | 207.2 | 290.8 | 438.15 | 847.54 |
| 8 | 154.93 | 211.39 | 299.4 | 439.36 | 859.98 |
| 16 | 280.06 | 222.51 | 319.33 | 445.43 | 860.43 |
| 32 | 348.79 | 474.72 | 355.11 | 500.69 | 872.44 |
| 64 | 544.38 | 603.89 | 779.21 | 576.34 | 916.67 |
| 128 | 1059.73 | 932.32 | 1054.84 | 1323.35 | 968.53 |
| 256 | 21981.76 | 1822.93 | 1765.59 | 1875.43 | 2212.38 |
| 512 | 24050.4 | 3830.67 | 3719 | 3146.81 | 3036.46 |
| 1024 | 33873.93 | 8328.25 | 7612.07 | 6758.61 | 5348.24 |
| 2048 | 59072.64 | 18545.69 | 15535.72 | 14898.76 | 10101.14 |
| 4096 | 80879.25 | 37798.9 | 33115.21 | 27867.95 | 19722.63 |
| 8192 | 137532.69 | 80270.04 | 63793.19 | 61073.62 | 38873.77 |
isend irecv
| Message size | k=2 | k = 4 | k = 8 | k = 16 | k = 32 |
|---|---|---|---|---|---|
| 1 | 138.96 | 204.74 | 291.22 | 428.89 | 847.15 |
| 2 | 141.24 | 205.62 | 290.97 | 430.1 | 848.62 |
| 4 | 145 | 204.63 | 295.42 | 438.76 | 851.13 |
| 8 | 156.82 | 211.84 | 298.93 | 444.47 | 865.3 |
| 16 | 279.29 | 223.89 | 324.6 | 449.07 | 862.43 |
| 32 | 350.09 | 479.04 | 357.18 | 502.31 | 880.38 |
| 64 | 540.79 | 601.15 | 799.61 | 583.52 | 932.97 |
| 128 | 1032.12 | 958.54 | 1087.27 | 1326.93 | 975.33 |
| 256 | 21958.61 | 1850.11 | 1753.25 | 1856.64 | 2210.36 |
| 512 | 23963.49 | 3837.31 | 3731 | 3241.24 | 3051.9 |
| 1024 | 33949.96 | 8348.78 | 7598.88 | 6802.96 | 5308.44 |
| 2048 | 59115.35 | 18520.23 | 15578.96 | 14939.42 | 10035.18 |
| 4096 | 80790.02 | 37814.46 | 33030 | 27908.15 | 19750.56 |
| 8192 | 137384.46 | 80003.8 | 63834.89 | 61007.52 | 38989.87 |
Bruck's method is designed to work in upper_comm with only the local/node leaders who can communicate directly with any other ranks (in upper comm) and that every pair of ranks are equally distant. (details in this paper - https://ieeexplore.ieee.org/document/642949).
Bruck's method is designed for optimizing latency by sacrificing bandwidth. So you are not likely to have performance improvement for mid/large messages by using a larger k by limiting ranks_per_node to be 1. It should be used in other node-aware approach for the all-to-all communication for internode only. The procedure looks like the data on each node will be gathered and combined by the local/node leaders; internode all-to-all using Bruck's method; scatter data from local/node leaders to other ranks on the same node.
@bosilca @lrbison Can you give this another review? Thanks!
there should be some additions to the user guide accompanying this change set that explain the algorithm and the mca tuning parameters. other wise it will be practically useless to any one but you. the capability for user tuning is important and should not be neglected.
note: this recent paper described a similar algorithm. they derive a mathematical model for selection of k, and found that a default value of k=sqrt(P), with P the comm size, works well.
@bosilca I experimented with your proposal to pack/unpack data into contiguous buffer instead of creating a datatype in https://github.com/open-mpi/ompi/commit/1b7e2546332dcceac94ff19beb86134a65d772e9 I profiled the run and saw the extra local copy actually creates a lot of overhead and has a regression for small message sizes compared to creating and committing the datatype(see data below). So I think we should use the previous commit.
OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)
| Message size | create datatype | pack into contiguous buffer |
|---|---|---|
| 1 | 146.11 | 197.28 |
| 2 | 141.23 | 196.42 |
| 4 | 146.65 | 197.07 |
| 8 | 154.93 | 195.45 |
| 16 | 280.06 | 209.76 |
| 32 | 348.79 | 309.00 |
| 64 | 544.38 | 535.37 |
| 128 | 1059.73 | 879.60 |
| 256 | 21981.76 | 23591.02 |
| 512 | 24050.4 | 3805.94 |
| 1024 | 33873.93 | 32402.23 |
| 2048 | 59072.64 | 53316.38 |
| 4096 | 80879.25 | 119598.57 |
| 8192 | 137532.69 | 158336.34 |
Thanks for taking a look. I have to say your results are concerning, because the extra overhead is terribly high, so high that I wonder how one small memcpy (for the 1 byte message as an example) could add 30ms. Would you mind sharing the code with me (maybe privately by email) ?
@bosilca Did you have a chance to look at https://github.com/open-mpi/ompi/commit/1b7e2546332dcceac94ff19beb86134a65d772e9?
@bosilca Since this PR is backward compatible and does not cause performance regression for the default k=2 case, I wonder if we can merge it and work on improvement as a separate project? Thanks!
From a code perspective this looks ok. But I still question the need for this, from a logical perspective. If I look at the results posted earlier here, the k=2 behave great until we hit 128 bytes, when it jumps up to 20x justifying the use of a larger radix. Clearly that jump in performance is problematic, and it solely leads to the tuned poor performance. Where is that jump coming from ?
@bosilca Makes sense. Jessie has rotated to another high-priority project. I will take a look at the latency jump.
@bosilca I did an analysis of the algorithm with EFA. The weird performance bumps at some message sizes is a result of 2 factors:
- The number of send-recv rounds at different radix: A higher radix requires more rounds, because each round carries less data.
- The datatype size(message size) in each round.
Below is an example on 16 nodes with 64 cores. I chose:
- ppn: 16, 32, 64
- radix: 2, 4, 8, 16
| 128-byte MPI_Alltoall | |||
|---|---|---|---|
| 16 nodes x 64 ppn | Radix | Total rounds | Bytes per round |
| 2 | 10 | 65,536 | |
| 4 | 15 | 32,768 | |
| 8 | 22 | 16,384(round 1- 21), 65,536(round 22) | |
| 16 | 33 | 81,92(round 1 - 30), 32,768(round 31 - 33) | |
| 16 nodes x 32 ppn | Radix | Total rounds | Bytes per round |
| 2 | 9 | 32,768 | |
| 4 | 13 | 16,384(round 1 - 12), 32,768(round 13) | |
| 8 | 21 | 8,192 | |
| 16 | 31 | 4,096(round 1 - 30), 32,768(round 31) | |
| 16 nodes x 16 ppn | Radix | Total rounds | Bytes per round |
| 2 | 8 | 16,384 | |
| 4 | 12 | 8,192 | |
| 8 | 17 | 4096(round 1 - 14), 8192(round 15 - 17) | |
| 16 | 30 | 2,048 |
The non-linear latency metrics we observed is the result of network layer, i.e. libfabric + EFA network. EFA uses a datagram protocol(SRD), which exposes MTU=8 KiB. For larger messages, libfabric will have to switch to other mechanisms:
- Segmentation into < 8K messages, and send them one by one; or,
- Use EFA device RDMA capability to initiate a one-sided operations. For this platform this can be done by RDMA read from the receiver side - note that this is a rendezvous process.
This means we cannot apply the theoretical latency formula in the Bruck paper, at least for EFA, due to the non-linear data transfer behavior.