bug: ch4/ofi/psm2: poor put/get performance when both origin and target datatypes are noncontig
When using RMA put or get to implement the halo exchange in 2D stencil, the performance of east/west exchange is much worse than that using send/recv.
Below is the performance numbers on 4 inter-connected processes on Argonne Bebop (Broadwell + OmniPath).
| Problem Size | PUT contig (north+south) | PUT noncontig (east+west) | PT2PT contig | PT2PT noncontig |
|---|---|---|---|---|
| 64 | 2.7076 | 24.7242 | 4.5245 | 3.5107 |
| 128 | 2.7854 | 54.9428 | 4.4331 | 2.2331 |
| 256 | 3.006 | 109.952 | 4.4955 | 2.387 |
| 512 | 3.3544 | 233.4858 | 2.7368 | 41.7567 |
| 1024 | 4.0334 | 501.6856 | 3.425 | 81.9862 |
| 2048 | 5.393 | 1003.2947 | 5.0462 | 162.1536 |
| 4096 | 13.8641 | 2005.5851 | 11.5831 | 336.516 |
| 8192 | 15.0235 | 4758.1552 | 16.3524 | 668.5294 |
| 16384 | 27.9602 | 9287.5796 | 24.4589 | 1343.4111 |
| 32768 | 46.1634 | 20758.3982 | 44.3566 | 2651.3244 |
In summary, the noncontig part shows up to 10x worse performance by using PUT(same for GET). This might be a performance issue of PUT/GET when both origin and target datatypes are noncontiguous.
Further investigation might be needed also for RMA over SHM.
Tagging @shawnccx @nusislam
@minsii how can I re-run the benchmarks used in the issue description? I would like to compare performance the current master with RMA non-contig changes included.
I think I used the MPI tutorial code. Can you try this: stencil_lock_put.c