Hui Zhou
Hui Zhou
test:mpich/custom netmod: ch4:ofi
@TApplencourt Add timeline debug to this issue
test:mpich/ch3/most test:mpich/ch4/most
Build, other than `ch4:ofi`, failed due to `MPIDI_POSIX_mpi_bcast_release_gather` not defined. Of course :( I need figure out a solution to define partial `coll_algorithms.txt` in the device layer -- similar to...
If `MPIR_CVAR_CH4_PROGRESS_THROTTLE=1` alleviates the issue, then it is the CPU/NIC memory contention issue.
It is a puzzle to me why different set of nodes behave differently, and sunspot never had the issue.
That said, I think with effort, it may be possible to tune the progress throttle algorithm to minimize its impact on normal performance and also work around the CPU/NIC contention...
Conclusion from offline discussion, somehow turning off the `no_hz` (https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt) hides the progress contention issue. I think the normal Scheduling-Clock Ticks interrupts the processes's busy polling on progress thus reducing...
test:mpich/custom env: VERBOSE=1 env: MPITEST_IGNORE_OUTPUT=1
test:mpich/custom netmod: ch4:ofi env: FI_LOG_LEVEL=info env: MPITEST_IGNORE_OUTPUT=1