test: rma_contig timeout under multiple-vci in ch4:ofi
In test rma/rma_contig 2, it runs roughly following tests:
if (rank == 0) {
for (I = 0; I < num_iter; i++) {
MPI_Win_lock(...);
MPI_Get/Put/Accumulate(...);
MPI_Win_unlock(...);
}
}
MPI_Barrier(MPI_COMM_WORLD);
The barrier is over vci 0 while the rma operation and window synchronization happens in vci 1. This test runs very slowly due to lack of progress on target process. It still progresses only due to we polls global progress. The rate of tests is proportionally affected by how often we poll global progress.
One thing ch3 had was a progress hook for RMA. It would iterate through the active windows and try to make progress on any outstanding operations. Adding something similar to ch4 might resolve this issue.
https://github.com/pmodels/mpich/blob/cd58c4274fb13f341d33047d6bb982622ed267c1/src/mpid/ch3/src/ch3u_rma_progress.c#L612-L642
The idea of VCI is to reduce thread contention. Having a thread polling other VCIs, even for the case that the other VCI needs progress, may bring the thread contention back, breaking the performance.
Yeah, that's a good point. I guess the issue for users is that enabling multi-VCI for an existing app could cause slowdown, since previously useful progress methods (e.g. MPI_BARRIER) may no longer work.
It's not clear what we should do here. The lack of progress in passive RMA is more a design issue that it's unlikely to have a real fix in ch4. In fact, the MPI_Barrier with single VCI that helps the RMA progress is more accidental than an intention.
We probably should provide user a CVAR to control how frequent we like to invoke global progress. The current default is once every 256, set internally. There is MPIR_CVAR_CH4_GLBOAL_PROGRESS=0 can be used to turn global progress completely off. We may want to extend this cvar into an enum, e.g. "high/default/low/off".