charm CMA support for passing data between processes on the same node

Original issue: https://charm.cs.illinois.edu/redmine/issues/1497

PXSHM exists for this purpose on the net layers. However it is not generally used in SMP mode to exchange data when multiple comm threads share the same node.

Shared memory safety of our pxshm implementation and portability is a concern, so if the implementation choice were PXSHM, we would need to smoothly fail over to not using it on nodes which do not support it. Note, that is a runtime property as it can depend on kernel module load choices on compute nodes which can differ from the head node on which charm was compiled.

Some experimentation should be undertaken to determine where (if anywhere) this is provides any benefit.

Apr 11 '17 21:04 ericjbohm

Original date: 2017-04-11 23:55:55

The main things to investigate here appear to be xpmem, pxshm, knem, limic, and cma. LiMIC appears to be tied to MPI's pt2pt semantics. OpenMPI's "vader" shared-memory BTL is able to use any of those implementations that are available (http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy) and I've seen results from several papers (look at Nathan Hjelm's publications) that xpmem is the best performing of those, with knem being second most performant and perhaps more portable than xpmem. OpenMPI can get to <0.3us 1-byte message latency within a node using xpmem on Cray XE6 and XC40. I've also seen Intel MPI achieve ~0.5us shared-memory latency for 1-byte messages on KNLs and Haswells. From looking at Charm SMP pingpong we're usually 5-10x worse than those numbers.

The main benefit will be for large messages, so a good first target would be the zero copy send API.

Apr 25 '19 02:04 stwhite91

Original date: 2017-10-26 21:54:42

Nitin is working on adding support for using Cross Memory Attach (CMA) for this. We already has an implementation working for the zero copy send API, which shows good performance. CMA is available by default on Linux kernels v3.2+ so is portable across most of the systems we care about.

Apr 25 '19 02:04 stwhite91

Did this get integrated?

Jun 27 '19 18:06 ericjbohm

Yes, it has been integrated into the RTS code. (https://github.com/UIUC-PPL/charm/commit/adbc4700fd7a4462347912d4c7a408988c57b3a9)

However, since it hasn't been benchmarked, it hasn't been enabled. We need to benchmark the CMA regular message performance and compare it with non-CMA performance, on each layer and determine the threshold sizes between which CMA can be used for regular messages.

Jun 27 '19 18:06 nitbhat

charm charm copied to clipboard

CMA support for passing data between processes on the same node

charm
charm copied to clipboard