One-sided communications in MPICH are considerably slower than those in Aurora MPICH
Tests conducted on 144-node runs in the queue alcf_kmd_val show that one-sided communications in mpich 4.3.0rc2 are slower by 18% than those in the default Aurora MPICH. A single-file reproducer /tmp/reproducer7-alcf_kmd_val.tgz is available for download from aurora-uan-0010.
For reference, the relevant code in Fortran:
call MPI_Win_Create(boxRegExpansion, buffer_size, dcmplx_size, MPI_INFO_NULL, MPI_COMM_RMGROUP, window, mpiError)
call MPI_Win_fence(0, window)
nMessagesRank = 0
nFence = 100
if(iam == 0) write(*,'(a,i0,a/)') "Invoke fence after every ", nFence, " messages"
do i = 1, nMessagesTotal
if(srcRank(i) == rmRank) then
! catch errors in the input data
if(srcAddress(i) >=0 .and. dstAddress(i) >= 0 .and. &
srcAddress(i)+dataSize(i)-1 < expansionSize .and. &
dstAddress(i)+dataSize(i)-1 < bufferSize(dstRank(i)) ) then
! rmRank in the sub-communicator requests the data from the destination rank, dstRank(i)
nElements = dataSize(i)
targetRank = dstRank(i)
call MPI_Get(boxRegExpansion(srcAddress(i)), nElements, MPI_DOUBLE_COMPLEX, targetRank, &
dstAddress(i), nElements, MPI_DOUBLE_COMPLEX, window, mpiError)
nMessagesRank = nMessagesRank + 1
else
write(*,'(a,i4,a,i8)') "Rank ", iam, " message ", i ! corrupted input data
endif
endif
if(mod(i,nFence) == 0) call MPI_Win_fence(0, window)
if(mod(i,nFence) == 0 .and. iam == 0) write(*,'(a,i12,a)') "Rank 0 conducted ",i," one-sided messages"
enddo
!write(*,'(a,i10)') "rank after get before fence: ", iam
!call flush(6)
call MPI_Win_fence(0, window)
!write(*,'(a,i10)') "rank after fence before free: ", iam
!call flush(6)
call MPI_Win_free(window, mpiError)
I have conducted performance test for one-sided communications between two GPU pointers on a single node by using a single-file reproducer (12 MPI ranks per node). The use of default Aurora MPICH in lustre_scaling queue makes the test completing in 2.5 seconds. The use of mpich 4.3.0rc2 in alcf_kmd_val queue leads the test completing in 2.9 seconds. This performance difference corresponds to 16% slowdown. The data file (data.txt) used by the single-file reproducer on a single node is attached to this message. data.txt.gz
The performance test of host-to-host one-sided communications on a single node shows 3x slowdown. The reproducer that can run either on a host or on a device is attached. reproducer9-host-device.tgz