distributed-ranges icon indicating copy to clipboard operation
distributed-ranges copied to clipboard

analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert

Open lslusarczyk opened this issue 1 year ago • 7 comments

Update: Bug in MPI Jira: https://jira.devtools.intel.com/browse/IMPI-4619

when running on devcloud ctest -R mhp-sycl-sort-tests-3

on branch https://github.com/lslusarczyk/distributed-ranges/tree/mateusz_sort_expose_mpi_assert

we hit

Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1160: cma_read_nbytes == size
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x14c1a5a7236c]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x14c1a5429131]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb22e38) [0x14c1a5922e38]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1fa41) [0x14c1a591fa41]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1cd4d) [0x14c1a591cd4d]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0x2f58b4) [0x14c1a50f58b4]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(PMPI_Wait+0x41f) [0x14c1a56816af]
./mhp-tests() [0x5c863d]
./mhp-tests() [0x58e124]
./mhp-tests() [0x6cdd0c]
./mhp-tests() [0x75676c]
./mhp-tests() [0x7374c5]
./mhp-tests() [0x738b33]
./mhp-tests() [0x73974f]
./mhp-tests() [0x74df0f]
./mhp-tests() [0x74cfcb]
./mhp-tests() [0x472f7f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x14c1a3ce3d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x14c1a3ce3e40]
./mhp-tests() [0x46f005]
Abort(1) on node 0: Internal error

Some links on useful Intel MPI documentation, tips and hacks:

Intel® MPI for GPU Clusters - article https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/intel-mpi-for-gpu-clusters.html

Environment variables influencing the way GPU support works.

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-support.html https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-pinning.html

Still, I found the tip for solution of the problem here: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/intel-mpi-error-line-1334-cma-read-nbytes-size/m-p/1329220

export I_MPI_SHM_CMA=0 helped in some cases (yet the behaviour seems to be not fully deterministic, maybe depends on which devcloud node is assigned for execution)

People had similar problems in the past: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-oneAPI-2021-4-SHM-Issue/m-p/1324805

When setting the env vars to:

export I_MPI_FABRICS=shm
export I_MPI_SHM_CMA=0
export I_MPI_OFFLOAD=1

You may also encounter:

Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h at line 2012: FALSE
...

Still, simple solution - copy memory from device to host - is countereffective, as IMPI supports GPU-GPU communication (see https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html#SECTION_3F5D70BDEFF84E3A84325A319BA53536)

lslusarczyk avatar Oct 31 '23 09:10 lslusarczyk