Hui Zhou
Hui Zhou
> Finally, just a FWIW: scanning the code underlying this PR, I noticed that you have a "fence" that collects data, and a "barrier" that does not collect data. Unfortunately,...
test:mpich/ch3/most test:mpich/ch4/most
test:mpich/custom env: MPIR_CVAR_ALLTOALLV_PAIRWISE_NEW=1 test:mpich/ch3/most test:mpich/ch4/most
test:mpich/custom env: MPIR_CVAR_ALLTOALLV_PAIRWISE_NEW=1 test:mpich/ch3/most test:mpich/ch4/most
The tests passed
@victor-anisimov I built a debug version of mpich on Aurora at `/home/hzhou/pull_requests/mpich-debug/_inst`. Could you try link with that and run the reproducer while setting `MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1`? The goal is to generate...
> `module unload mpich` Make sure to set `PALS_PMI=pmix` if you unload mpich module. > while setting `MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1` Make it `MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=10`. 1 second timeout was too short. 10 second seems...
I think I got some timeouts, need confirm with larger `MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT` - ``` ... Rank 0 conducted 1148200 one-sided messages Rank 0 conducted 1148300 one-sided messages 3 pending requests in...
``` grep 'All Done' job*.out job10.out:All Done job11.out:All Done job12.out:All Done job13.out:All Done job14.out:All Done job16.out:All Done job17.out:All Done job18.out:All Done job19.out:All Done job1.out:All Done job21.out:All Done job22.out:All Done job25.out:All...
Thanks @raffenet for the potential clue. I can see how that (single process get overwhelmed due to lower layer failure or congestion) may cause the issue.