Gengbin Zheng
Gengbin Zheng
> @zhenggb72 Please add pull request descriptions. done
> The first two commits are independent. If you split, we can merge them right away. > > I'll need more time to go over the algorithms to determine the...
Some document about the algorithms: 1. Single receive buffer with data copy: (figure 1) A single receive buffer is used to receive the data from all (k-1) neighbors across all...
outdated. close for now.
@zippylab Alex is working with me on this. We think we have a fix for this issue, and we would like some reproducer to test it out.
> @zhenggb72 What is your suggested path forward? I don't have much to add. As long as this PR does not change the behavior of the common scenarios, whatever you...
Thanks for the reproducer. it appears in GPU pipelining, there is potentially scenarios that chunks are written into receive buffers out-of-order. I created a PR https://github.com/pmodels/mpich/pull/7182 to fix it.
I can reproduce the hang with the main branch, however our latest drop version works. Could you try the drop version that was installed on Aurora?
@jcosborn please try the Intel provided drop: module load mpich/opt/4.2.3-intel It seems to run with that version. I don't know what has changed in the default build.
> We should consider if setting `MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1` by default is the right thing to do rather than require users (or modules) to set it on Aurora and other systems. It...