mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4: shm: fix data type for recv_bytes in MPIDI_POSIX_mpi_release_gat…

Open nmorey opened this issue 4 months ago • 1 comments

The number of received bytes in release_gather_release is badly cast between int and MPI_Aint. On most arch this is not an issue, but for Big-Endian 64b arch (s390x) it ends up losing the actual value. Fix the issue but writing the whole MPI_AInt in the shm_buf instead of just an int.

This bug was found on 4.3.2 while debugging on s390x with ch4:ofi:

> mpiexec -np 4     ./file_info -fname test
Abort(476133135) on node 1 (rank 1 in comm 0): Fatal error in internal_Bcast: Other MPI error, error stack:
internal_Bcast(116)........................: MPI_Bcast(buffer=0x1004174, count=1, MPI_INT, 0, MPI_COMM_WORLD) failed
MPID_Bcast(295)............................: 
MPIDI_Bcast_allcomm_composition_json(239)..: 
MPIDI_Bcast_intra_composition_alpha(292)...: 
MPIDI_POSIX_mpi_bcast(278).................: 
MPIDI_POSIX_mpi_bcast_release_gather(127)..: 
MPIDI_POSIX_mpi_release_gather_release(225): message sizes do not match across processes in the collective routine: Received 0 but expected 4

nmorey avatar Nov 08 '25 23:11 nmorey

test:mpich/ch3/most test:mpich/ch4/most

hzhou avatar Nov 19 '25 13:11 hzhou