mpich bug: ch4/ofi/psm2: poor put/get performance when both origin and target datatypes are noncontig

When using RMA put or get to implement the halo exchange in 2D stencil, the performance of east/west exchange is much worse than that using send/recv.

Below is the performance numbers on 4 inter-connected processes on Argonne Bebop (Broadwell + OmniPath).

Problem Size	PUT contig (north+south)	PUT noncontig (east+west)	PT2PT contig	PT2PT noncontig
64	2.7076	24.7242	4.5245	3.5107
128	2.7854	54.9428	4.4331	2.2331
256	3.006	109.952	4.4955	2.387
512	3.3544	233.4858	2.7368	41.7567
1024	4.0334	501.6856	3.425	81.9862
2048	5.393	1003.2947	5.0462	162.1536
4096	13.8641	2005.5851	11.5831	336.516
8192	15.0235	4758.1552	16.3524	668.5294
16384	27.9602	9287.5796	24.4589	1343.4111
32768	46.1634	20758.3982	44.3566	2651.3244

In summary, the noncontig part shows up to 10x worse performance by using PUT(same for GET). This might be a performance issue of PUT/GET when both origin and target datatypes are noncontiguous.

Further investigation might be needed also for RMA over SHM.

Jul 28 '18 18:07 minsii

Tagging @shawnccx @nusislam

Jul 30 '18 14:07 hajimefu

@minsii how can I re-run the benchmarks used in the issue description? I would like to compare performance the current master with RMA non-contig changes included.

Jun 03 '20 16:06 raffenet

I think I used the MPI tutorial code. Can you try this: stencil_lock_put.c

Jun 04 '20 03:06 minsii