ompi mca/coll: Add any radix k for alltoall bruck algorithm

This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k.

Apr 05 '24 23:04 jiaxiyan

I was planning to review it after they figure out the issue with replacing the blocking ompi_coll_base_sendrecv with non-blocking communications.

Apr 29 '24 19:04 bosilca

@bosilca I don't follow this comment:

when the upper and lower level of the communication stack both try to generate multi-rail traffic things cant go well.

Are you suggesting that collectives shouldn't be making PML isend/irecv directly?

May 07 '24 17:05 lrbison

No, I was wondering how would a multi-rail enabled collective component (which will indeed post multiple isend requests) interacts with a multi-rail enabled low-level communication library (which will split all large messages across multiple lanes) ?

May 08 '24 08:05 bosilca

@jiaxiyan Please fix the build.

May 16 '24 23:05 wenduwan

We decide to keep the blocking sendrecv because the performance is slightly better than nonblocking. (See test data below) We need to wait for each isend to complete before we can copy back new data from recvbuf to tmpbuf, so it does not show much improvement.

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

ompi_coll_base_sendrecv

Message size	k=2	k = 4	k = 8	k = 16	k = 32
1	146.11	201.24	290.62	426.96	847.91
2	141.23	201.72	295.76	431.24	844.11
4	146.65	207.2	290.8	438.15	847.54
8	154.93	211.39	299.4	439.36	859.98
16	280.06	222.51	319.33	445.43	860.43
32	348.79	474.72	355.11	500.69	872.44
64	544.38	603.89	779.21	576.34	916.67
128	1059.73	932.32	1054.84	1323.35	968.53
256	21981.76	1822.93	1765.59	1875.43	2212.38
512	24050.4	3830.67	3719	3146.81	3036.46
1024	33873.93	8328.25	7612.07	6758.61	5348.24
2048	59072.64	18545.69	15535.72	14898.76	10101.14
4096	80879.25	37798.9	33115.21	27867.95	19722.63
8192	137532.69	80270.04	63793.19	61073.62	38873.77

isend irecv

Message size	k=2	k = 4	k = 8	k = 16	k = 32
1	138.96	204.74	291.22	428.89	847.15
2	141.24	205.62	290.97	430.1	848.62
4	145	204.63	295.42	438.76	851.13
8	156.82	211.84	298.93	444.47	865.3
16	279.29	223.89	324.6	449.07	862.43
32	350.09	479.04	357.18	502.31	880.38
64	540.79	601.15	799.61	583.52	932.97
128	1032.12	958.54	1087.27	1326.93	975.33
256	21958.61	1850.11	1753.25	1856.64	2210.36
512	23963.49	3837.31	3731	3241.24	3051.9
1024	33949.96	8348.78	7598.88	6802.96	5308.44
2048	59115.35	18520.23	15578.96	14939.42	10035.18
4096	80790.02	37814.46	33030	27908.15	19750.56
8192	137384.46	80003.8	63834.89	61007.52	38989.87

May 16 '24 23:05 jiaxiyan

Bruck's method is designed to work in upper_comm with only the local/node leaders who can communicate directly with any other ranks (in upper comm) and that every pair of ranks are equally distant. (details in this paper - https://ieeexplore.ieee.org/document/642949).

Bruck's method is designed for optimizing latency by sacrificing bandwidth. So you are not likely to have performance improvement for mid/large messages by using a larger k by limiting ranks_per_node to be 1. It should be used in other node-aware approach for the all-to-all communication for internode only. The procedure looks like the data on each node will be gathered and combined by the local/node leaders; internode all-to-all using Bruck's method; scatter data from local/node leaders to other ranks on the same node.

May 17 '24 01:05 juntangc

@bosilca @lrbison Can you give this another review? Thanks!

May 21 '24 23:05 jiaxiyan

there should be some additions to the user guide accompanying this change set that explain the algorithm and the mca tuning parameters. other wise it will be practically useless to any one but you. the capability for user tuning is important and should not be neglected.

May 28 '24 15:05 burlen

note: this recent paper described a similar algorithm. they derive a mathematical model for selection of k, and found that a default value of k=sqrt(P), with P the comm size, works well.

May 28 '24 15:05 burlen

@bosilca I experimented with your proposal to pack/unpack data into contiguous buffer instead of creating a datatype in https://github.com/open-mpi/ompi/commit/1b7e2546332dcceac94ff19beb86134a65d772e9 I profiled the run and saw the extra local copy actually creates a lot of overhead and has a regression for small message sizes compared to creating and committing the datatype(see data below). So I think we should use the previous commit.

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

Message size	create datatype	pack into contiguous buffer
1	146.11	197.28
2	141.23	196.42
4	146.65	197.07
8	154.93	195.45
16	280.06	209.76
32	348.79	309.00
64	544.38	535.37
128	1059.73	879.60
256	21981.76	23591.02
512	24050.4	3805.94
1024	33873.93	32402.23
2048	59072.64	53316.38
4096	80879.25	119598.57
8192	137532.69	158336.34

Jun 18 '24 22:06 jiaxiyan

Thanks for taking a look. I have to say your results are concerning, because the extra overhead is terribly high, so high that I wonder how one small memcpy (for the 1 byte message as an example) could add 30ms. Would you mind sharing the code with me (maybe privately by email) ?

Jun 19 '24 15:06 bosilca

@bosilca Did you have a chance to look at https://github.com/open-mpi/ompi/commit/1b7e2546332dcceac94ff19beb86134a65d772e9?

Jun 24 '24 17:06 jiaxiyan

@bosilca Since this PR is backward compatible and does not cause performance regression for the default k=2 case, I wonder if we can merge it and work on improvement as a separate project? Thanks!

Jul 09 '24 13:07 wenduwan

From a code perspective this looks ok. But I still question the need for this, from a logical perspective. If I look at the results posted earlier here, the k=2 behave great until we hit 128 bytes, when it jumps up to 20x justifying the use of a larger radix. Clearly that jump in performance is problematic, and it solely leads to the tuned poor performance. Where is that jump coming from ?

Jul 09 '24 22:07 bosilca

@bosilca Makes sense. Jessie has rotated to another high-priority project. I will take a look at the latency jump.

Jul 10 '24 22:07 wenduwan

@bosilca I did an analysis of the algorithm with EFA. The weird performance bumps at some message sizes is a result of 2 factors:

The number of send-recv rounds at different radix: A higher radix requires more rounds, because each round carries less data.
The datatype size(message size) in each round.

Below is an example on 16 nodes with 64 cores. I chose:

ppn: 16, 32, 64
radix: 2, 4, 8, 16

128-byte MPI_Alltoall

16 nodes x 64 ppn	Radix	Total rounds	Bytes per round
	2	10	65,536
	4	15	32,768
	8	22	16,384(round 1- 21), 65,536(round 22)
	16	33	81,92(round 1 - 30), 32,768(round 31 - 33)

16 nodes x 32 ppn	Radix	Total rounds	Bytes per round
	2	9	32,768
	4	13	16,384(round 1 - 12), 32,768(round 13)
	8	21	8,192
	16	31	4,096(round 1 - 30), 32,768(round 31)

16 nodes x 16 ppn	Radix	Total rounds	Bytes per round
	2	8	16,384
	4	12	8,192
	8	17	4096(round 1 - 14), 8192(round 15 - 17)
	16	30	2,048

The non-linear latency metrics we observed is the result of network layer, i.e. libfabric + EFA network. EFA uses a datagram protocol(SRD), which exposes MTU=8 KiB. For larger messages, libfabric will have to switch to other mechanisms:

Segmentation into < 8K messages, and send them one by one; or,
Use EFA device RDMA capability to initiate a one-sided operations. For this platform this can be done by RDMA read from the receiver side - note that this is a rendezvous process.

This means we cannot apply the theoretical latency formula in the Bruck paper, at least for EFA, due to the non-linear data transfer behavior.

Jul 12 '24 22:07 wenduwan