Hui Zhou

Results 695 comments of Hui Zhou

test:mpich/custom netmod: ch4:ofi

Build, other than `ch4:ofi`, failed due to `MPIDI_POSIX_mpi_bcast_release_gather` not defined. Of course :( I need figure out a solution to define partial `coll_algorithms.txt` in the device layer -- similar to...

If `MPIR_CVAR_CH4_PROGRESS_THROTTLE=1` alleviates the issue, then it is the CPU/NIC memory contention issue.

It is a puzzle to me why different set of nodes behave differently, and sunspot never had the issue.

That said, I think with effort, it may be possible to tune the progress throttle algorithm to minimize its impact on normal performance and also work around the CPU/NIC contention...

Conclusion from offline discussion, somehow turning off the `no_hz` (https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt) hides the progress contention issue. I think the normal Scheduling-Clock Ticks interrupts the processes's busy polling on progress thus reducing...

test:mpich/custom env: VERBOSE=1 env: MPITEST_IGNORE_OUTPUT=1

test:mpich/custom netmod: ch4:ofi env: FI_LOG_LEVEL=info env: MPITEST_IGNORE_OUTPUT=1