mpl/ze: fast_memcpy crash due to mmap in implicit mode
One-sided communications crash on very small window sizes when using PVC in implicit scaling mode (gpu_dev_compact.sh) with the following error
mmap failed fd: 46 size: 969998336 mmap device to host: Invalid argument Abort(15) on node 1: Fatal error in internal_Get: Other MPI error
The same test works fine when using PVC in explicit scaling mode (gpu_tile_compact.sh)
The reproducer is a Fortran-90 file that runs on a single node in an interactive session using 6 MPI ranks.
NOTES:
Call path: MPID_Get -> MPIDI_POSIX_do_get -> MPIR_Ilocalcopy_gpu -> MPL_gpu_fast_memcpy -> MPL_ze_mmap_device_pointer ->
mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fds[0], 0); -> EINVAL due to size too large.
I believe in implicit mode, the device memory straddles between two tiles.
Interestingly, the problem in the implicit mode is not with the total size of the buffer but with the size following a certain rule.
bufferSize = 1024 * 63 ! works bufferSize = 1024 * 64 ! works bufferSize = 1024 * 65 ! crashes bufferSize = 1024 * 96 ! crashes bufferSize = 1024 * 128 ! works bufferSize = 100352 ! crashes bufferSize = 101041 ! crashes bufferSize = 60624600_8 ! crashes bufferSize = 60624600_8 * 60_8 ! works bufferSize = 1024_8 * 1024_8 * 1024_8 ! works bufferSize = 1024_8 * 1024_8 * 1024_8 / 16_8 * 120_8 ! works 120 GB window
The use of MPIR_CVAR_CH4_IPC_GPU_RMA_ENGINE_TYPE=yaksa helps with a small reproducer, however the full app still crashes on 342 nodes in implicit scaling mode (one rank per GPU, 6 ranks per node) with the error: Fatal error in internal_Win_create: Other MPI error. Are there any better workarounds?