mpich icon indicating copy to clipboard operation
mpich copied to clipboard

mpl/ze: fast_memcpy crash due to mmap in implicit mode

Open victor-anisimov opened this issue 3 months ago • 4 comments

One-sided communications crash on very small window sizes when using PVC in implicit scaling mode (gpu_dev_compact.sh) with the following error

mmap failed fd: 46 size: 969998336 mmap device to host: Invalid argument Abort(15) on node 1: Fatal error in internal_Get: Other MPI error

The same test works fine when using PVC in explicit scaling mode (gpu_tile_compact.sh)

The reproducer is a Fortran-90 file that runs on a single node in an interactive session using 6 MPI ranks.

test.F90.txt

run-test.sh

victor-anisimov avatar Oct 06 '25 22:10 victor-anisimov

NOTES:

Call path: MPID_Get -> MPIDI_POSIX_do_get -> MPIR_Ilocalcopy_gpu -> MPL_gpu_fast_memcpy -> MPL_ze_mmap_device_pointer -> mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fds[0], 0); -> EINVAL due to size too large.

I believe in implicit mode, the device memory straddles between two tiles.

hzhou avatar Oct 07 '25 17:10 hzhou

Interestingly, the problem in the implicit mode is not with the total size of the buffer but with the size following a certain rule.

bufferSize = 1024 * 63      ! works bufferSize = 1024 * 64      ! works bufferSize = 1024 * 65      ! crashes bufferSize = 1024 * 96      ! crashes bufferSize = 1024 * 128      ! works bufferSize = 100352      ! crashes bufferSize = 101041      ! crashes bufferSize = 60624600_8      ! crashes bufferSize = 60624600_8 * 60_8      ! works bufferSize = 1024_8 * 1024_8 * 1024_8      ! works bufferSize = 1024_8 * 1024_8 * 1024_8 / 16_8 * 120_8      ! works 120 GB window

victor-anisimov avatar Oct 07 '25 18:10 victor-anisimov

Here is a smaller reproducer using only 2 ranks

test-small.F90.txt run-small-test.sh

victor-anisimov avatar Oct 07 '25 19:10 victor-anisimov

The use of MPIR_CVAR_CH4_IPC_GPU_RMA_ENGINE_TYPE=yaksa helps with a small reproducer, however the full app still crashes on 342 nodes in implicit scaling mode (one rank per GPU, 6 ranks per node) with the error: Fatal error in internal_Win_create: Other MPI error. Are there any better workarounds?

victor-anisimov avatar Nov 05 '25 03:11 victor-anisimov