mpich icon indicating copy to clipboard operation
mpich copied to clipboard

jenkins: async pingping tests

Open hzhou opened this issue 5 years ago • 1 comments

In the pingping tests, one process blindly sends and the other process recv. Without barrier between batches, and when the recv side can't keep up posting the MPI_Recv -- such is the case when async progress thread is enabled and the MPI_Recv is constantly interrupted by the progress thread -- huge message queue may accumulate on the recv side. Currently, both libfabric and ucx have trouble dealing with huge message queue. libfabric has issue that it tries to match the entire message queue against entire posted recv queue in every progress_test. A fix for the sockets provider is in PR #4466. For the ucx, the current design requires a ucx_tag_probe_nb in every progress_test, which can be very expensive when message queue is large. The solution is to re-design how ucx netmod handle active message.

Meanwhile, we are xfailing some of the async pingping tests (4 of them, all with sendcnt=1 and testsize=32).

hzhou avatar Apr 28 '20 18:04 hzhou

Tests such as the pingping tests may flood the receiver with unexpected messages. This is intrinsically a performance issue. No design can deal with unlimited unexpected messages. So I think the basic tests should limit the amount of unexpected messages and increase the stress level for performance testing.

hzhou avatar May 29 '24 12:05 hzhou