mpich icon indicating copy to clipboard operation
mpich copied to clipboard

am-only build with UCX occasionally fails osu_bw benchmark in inter-NUMA setup

Open yfguo opened this issue 1 year ago • 1 comments

Just taking some notes.

I run into this problem on skylake when testing. gcc 14.1.0 with UCX main branch.

UCX v1.16.0 also have this problem UCX v1.15.0 (MPICH embedded) is working fine.

failure output

# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       5.26
2                      10.71
4                      18.79
8                      41.67
16                     95.75
32                     59.16
64                     91.74
128                   144.74
256                   269.42
512                   527.92
1024                 1076.21
2048                 1903.21
4096                 3145.84
8192                 3293.59
16384                3815.50
Assertion failed in file /vast/users/yguo/shm_bench/mpich/main/src/mpid/ch4/src/mpidig_pt2pt_callbacks.c at line 243: data_sz <= MPIR_CVAR_CH4_PACK_BUFFER_SIZE
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x324a62) [0x7f59220aba62]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x2980a4) [0x7f592201f0a4]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x2f849f) [0x7f592207f49f]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x30fca8) [0x7f5922096ca8]
/vast/users/yguo/shm_bench/install/ucx/main-gnu-default/lib/libucp.so.0(ucp_am_long_middle_handler+0x1f3) [0x7f5921724b43]
/vast/users/yguo/shm_bench/install/ucx/main-gnu-default/lib/libuct.so.0(+0x1a2cc) [0x7f59216d32cc]
/vast/users/yguo/shm_bench/install/ucx/main-gnu-default/lib/libucp.so.0(ucp_worker_progress+0x22) [0x7f5921746832]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x2785c3) [0x7f5921fff5c3]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x27b28e) [0x7f592200228e]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x27b4d7) [0x7f59220024d7]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x237eb1) [0x7f5921fbeeb1]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x238c41) [0x7f5921fbfc41]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x1a5c8f) [0x7f5921f2cc8f]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x20ea46) [0x7f5921f95a46]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x20eb69) [0x7f5921f95b69]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(+0x21096f) [0x7f5921f9796f]
/vast/users/yguo/shm_bench/install/mpich/main-gnu-ucx-am/lib/libmpi.so.0(MPI_Barrier+0x332) [0x7f5921ddd112]
./c/mpi/pt2pt/standard/osu_bw() [0x4024c7]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7f592180e24d]
./c/mpi/pt2pt/standard/osu_bw() [0x40318a]
Abort(1) on node 1: Internal error

yfguo avatar Jul 25 '24 16:07 yfguo

@yfguo

I have a few questions to get this to work:

What is the exact hardware configuration of the Skylake system you are using? Are there any specific environment variables set for UCX or MPICH that might affect their behavior? Are you using any specific configuration options or flags when building UCX and MPICH from source? Can you provide the exact steps you followed to build and install UCX and MPICH from source? What command are you using to run the OSU benchmark osu_bw? Are there any specific input parameters or configurations used for the benchmark?

abeltre1 avatar Feb 05 '25 04:02 abeltre1