mpich icon indicating copy to clipboard operation
mpich copied to clipboard

test: rma_contig timeout under multiple-vci in ch4:ofi

Open hzhou opened this issue 4 years ago • 4 comments

In test rma/rma_contig 2, it runs roughly following tests:

if (rank == 0) {
    for (I = 0; I < num_iter; i++) {
        MPI_Win_lock(...);
        MPI_Get/Put/Accumulate(...);
        MPI_Win_unlock(...);
    }
}
MPI_Barrier(MPI_COMM_WORLD);

The barrier is over vci 0 while the rma operation and window synchronization happens in vci 1. This test runs very slowly due to lack of progress on target process. It still progresses only due to we polls global progress. The rate of tests is proportionally affected by how often we poll global progress.

hzhou avatar Sep 22 '21 13:09 hzhou

One thing ch3 had was a progress hook for RMA. It would iterate through the active windows and try to make progress on any outstanding operations. Adding something similar to ch4 might resolve this issue.

https://github.com/pmodels/mpich/blob/cd58c4274fb13f341d33047d6bb982622ed267c1/src/mpid/ch3/src/ch3u_rma_progress.c#L612-L642

raffenet avatar Oct 13 '21 17:10 raffenet

The idea of VCI is to reduce thread contention. Having a thread polling other VCIs, even for the case that the other VCI needs progress, may bring the thread contention back, breaking the performance.

hzhou avatar Oct 27 '21 02:10 hzhou

Yeah, that's a good point. I guess the issue for users is that enabling multi-VCI for an existing app could cause slowdown, since previously useful progress methods (e.g. MPI_BARRIER) may no longer work.

raffenet avatar Oct 27 '21 14:10 raffenet

It's not clear what we should do here. The lack of progress in passive RMA is more a design issue that it's unlikely to have a real fix in ch4. In fact, the MPI_Barrier with single VCI that helps the RMA progress is more accidental than an intention.

We probably should provide user a CVAR to control how frequent we like to invoke global progress. The current default is once every 256, set internally. There is MPIR_CVAR_CH4_GLBOAL_PROGRESS=0 can be used to turn global progress completely off. We may want to extend this cvar into an enum, e.g. "high/default/low/off".

hzhou avatar Nov 01 '21 19:11 hzhou