Gengbin Zheng comments

Results 11 comments of


                                            Gengbin Zheng

coll: Iallreduce TSP recursive exchange with no dtcopy

> @zhenggb72 Please add pull request descriptions. done

coll: Iallreduce TSP recursive exchange with no dtcopy

> The first two commits are independent. If you split, we can merge them right away. > > I'll need more time to go over the algorithms to determine the...

coll: Iallreduce TSP recursive exchange with no dtcopy

Some document about the algorithms: 1. Single receive buffer with data copy: (figure 1) A single receive buffer is used to receive the data from all (k-1) neighbors across all...

coll: Iallreduce TSP recursive exchange with no dtcopy

outdated. close for now.

Memory growth with GPU-aware MPICH on Intel PVC GPUs

@zippylab Alex is working with me on this. We think we have a fix for this issue, and we would like some reproducer to test it out.

coll: add coll_group to collective interfaces

> @zhenggb72 What is your suggested path forward? I don't have much to add. As long as this PR does not change the behavior of the common scenarios, whatever you...

Wrong data with MPI send/recv and pipelining on Intel GPUs

Thanks for the reproducer. it appears in GPU pipelining, there is potentially scenarios that chunks are written into receive buffers out-of-order. I created a PR https://github.com/pmodels/mpich/pull/7182 to fix it.

Wrong data with MPI send/recv and pipelining on Intel GPUs

I can reproduce the hang with the main branch, however our latest drop version works. Could you try the drop version that was installed on Aurora?

Wrong data with MPI send/recv and pipelining on Intel GPUs

@jcosborn please try the Intel provided drop: module load mpich/opt/4.2.3-intel It seems to run with that version. I don't know what has changed in the default build.

[Aurora] GPU pipelining failure/performance degradation

> We should consider if setting `MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1` by default is the right thing to do rather than require users (or modules) to set it on Aurora and other systems. It...