am-only build with UCX occasionally fails osu_bw benchmark in inter-NUMA setup
Just taking some notes.
I run into this problem on skylake when testing. gcc 14.1.0 with UCX main branch.
UCX v1.16.0 also have this problem UCX v1.15.0 (MPICH embedded) is working fine.
failure output
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 5.26
2 10.71
4 18.79
8 41.67
16 95.75
32 59.16
64 91.74
128 144.74
256 269.42
512 527.92
1024 1076.21
2048 1903.21
4096 3145.84
8192 3293.59
16384 3815.50
Assertion failed in file /vast/users/yguo/shm_bench/mpich/main/src/mpid/ch4/src/mpidig_pt2pt_callbacks.c at line 243: data_sz <= MPIR_CVAR_CH4_PACK_BUFFER_SIZE
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x324a62) [0x7f59220aba62]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x2980a4) [0x7f592201f0a4]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x2f849f) [0x7f592207f49f]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x30fca8) [0x7f5922096ca8]
/vast/users/yguo/shm_bench/install/ucx/main-gnu-default/lib/libucp.so.0(ucp_am_long_middle_handler+0x1f3) [0x7f5921724b43]
/vast/users/yguo/shm_bench/install/ucx/main-gnu-default/lib/libuct.so.0(+0x1a2cc) [0x7f59216d32cc]
/vast/users/yguo/shm_bench/install/ucx/main-gnu-default/lib/libucp.so.0(ucp_worker_progress+0x22) [0x7f5921746832]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x2785c3) [0x7f5921fff5c3]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x27b28e) [0x7f592200228e]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x27b4d7) [0x7f59220024d7]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x237eb1) [0x7f5921fbeeb1]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x238c41) [0x7f5921fbfc41]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x1a5c8f) [0x7f5921f2cc8f]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x20ea46) [0x7f5921f95a46]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x20eb69) [0x7f5921f95b69]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x21096f) [0x7f5921f9796f]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(MPI_Barrier+0x332) [0x7f5921ddd112]
./c/mpi/pt2pt/standard/osu_bw() [0x4024c7]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7f592180e24d]
./c/mpi/pt2pt/standard/osu_bw() [0x40318a]
Abort(1) on node 1: Internal error
@yfguo
I have a few questions to get this to work:
What is the exact hardware configuration of the Skylake system you are using? Are there any specific environment variables set for UCX or MPICH that might affect their behavior? Are you using any specific configuration options or flags when building UCX and MPICH from source? Can you provide the exact steps you followed to build and install UCX and MPICH from source? What command are you using to run the OSU benchmark osu_bw? Are there any specific input parameters or configurations used for the benchmark?