mpich One-sided communications in MPICH are considerably slower than those in Aurora MPICH

Tests conducted on 144-node runs in the queue alcf_kmd_val show that one-sided communications in mpich 4.3.0rc2 are slower by 18% than those in the default Aurora MPICH. A single-file reproducer /tmp/reproducer7-alcf_kmd_val.tgz is available for download from aurora-uan-0010.

Jan 14 '25 22:01 victor-anisimov

For reference, the relevant code in Fortran:

  call MPI_Win_Create(boxRegExpansion, buffer_size, dcmplx_size, MPI_INFO_NULL, MPI_COMM_RMGROUP, window, mpiError)
  call MPI_Win_fence(0, window)
  nMessagesRank = 0
  nFence = 100
  if(iam == 0) write(*,'(a,i0,a/)') "Invoke fence after every ", nFence, " messages"
  do i = 1, nMessagesTotal
    if(srcRank(i) == rmRank) then
      ! catch errors in the input data
      if(srcAddress(i) >=0 .and. dstAddress(i) >= 0 .and. &
         srcAddress(i)+dataSize(i)-1 < expansionSize .and. &
         dstAddress(i)+dataSize(i)-1 < bufferSize(dstRank(i)) ) then
           ! rmRank in the sub-communicator requests the data from the destination rank, dstRank(i)
           nElements  = dataSize(i)
           targetRank = dstRank(i)
           call MPI_Get(boxRegExpansion(srcAddress(i)), nElements, MPI_DOUBLE_COMPLEX, targetRank, &
                        dstAddress(i), nElements, MPI_DOUBLE_COMPLEX, window, mpiError)
        nMessagesRank = nMessagesRank + 1
      else
        write(*,'(a,i4,a,i8)') "Rank ", iam, " message ", i   ! corrupted input data
      endif
    endif
    if(mod(i,nFence) == 0) call MPI_Win_fence(0, window)
    if(mod(i,nFence) == 0 .and. iam == 0) write(*,'(a,i12,a)') "Rank 0 conducted ",i," one-sided messages"
  enddo
  !write(*,'(a,i10)') "rank after get before fence: ", iam
  !call flush(6)
  call MPI_Win_fence(0, window)
  !write(*,'(a,i10)') "rank after fence before free: ", iam
  !call flush(6)
  call MPI_Win_free(window, mpiError)

Jan 14 '25 22:01 hzhou

I have conducted performance test for one-sided communications between two GPU pointers on a single node by using a single-file reproducer (12 MPI ranks per node). The use of default Aurora MPICH in lustre_scaling queue makes the test completing in 2.5 seconds. The use of mpich 4.3.0rc2 in alcf_kmd_val queue leads the test completing in 2.9 seconds. This performance difference corresponds to 16% slowdown. The data file (data.txt) used by the single-file reproducer on a single node is attached to this message. data.txt.gz

Jan 15 '25 19:01 victor-anisimov

The performance test of host-to-host one-sided communications on a single node shows 3x slowdown. The reproducer that can run either on a host or on a device is attached. reproducer9-host-device.tgz

Jan 15 '25 23:01 victor-anisimov