mpich
mpich copied to clipboard
Issues on running MPICH on Summit
Problem
The current MPICH does not run in the current Summit environment with jsrun
. It causes an error in PMIx_Init().
How to reproduce the issue
I used the latest MPICH and used the default MPI (IBM's spectrum MPI and its PMIx) and compile the following after loading CUDA and its GCC modules:
./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
--with-pm=none --with-pmix=$MPI_ROOT
# if you use CUDA
# ./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
--enable-gpu-tests-only --with-cuda=$CUDAPATH --with-pm=none --with-pmix=$MPI_ROOT
It shows the following error when I run cpi
.
bash-4.2$ echo "$MPI_ROOT"
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-.../
bash-4.2$ jsrun -n 1 -r 1 -a 1 -g 1 --smpiargs="-disable_gpu_hooks" ./cpi
Abort(201935621) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(109): MPI_Comm_size(comm=0x18d12a0, size=0x2000000b049c) failed
PMPI_Comm_size(66).: Invalid communicator
When I use GDB on a compute node, an error seems in PMIx_Init()
: supposedly, illegal memory access happened on initialization.
Note that libpmix
is linked against:
libpmix.so.2 =>
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/spectrum-mpi-10.3.1.2-.../lib/libpmix.so.2
It seems that, with exactly the same configure
settings, MPICH works over jsrun
without any problem on a Summit-like IBM-based system (Ascent: https://docs.olcf.ornl.gov/systems/ascent_user_guide.html), so this might be a Summit-specific issue.
Workaround (example: running two processes)
Workaround is manual host setting, which at least worked on Summit. The following worked with MPICH, 404cd8a920f4cda93d3e65dcd4e942948935a8c8.
- Compile MPICH with CUDA+UCX (installed to
MPICH_CUDA_PATH
)
# You also need to install newer libtool/autotool etc to compile MPICH.
module load cuda/11.0.3 gcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_CUDA_PATH --enable-ch4-am-only --enable-gpu-tests-only --with-cuda="$(realpath $(dirname $(which nvcc))/..)" CC=gcc CXX=gcc
- Compile MPICH without CUDA (installed to
MPICH_NOCUDA_PATH
) to getmpiexec
that does not need CUDA. module loadgcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_NOCUDA_PATH --enable-ch4-am-only CC=gcc CXX=gcc
- Allocate two nodes (then you will login a batch node)
bsub -W 2:00 -nnodes 2 -P csc371 -Is $SHELL
- On a batch node, run the following to get command after loading all the modules
# Get LD_LIBRARY_PATH on a compute node
# $ jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH
# Get a list of accessible hosts
# $ jsrun -n 2 -r 1 hostname | paste -d, -s -
echo "# two nodes, one process per node"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -host $(jsrun -n 2 -r 1 hostname | paste -d, -s -) -n 2 <APP>"
echo ""
echo "# one node, two processes"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -env CUDA_VISIBLE_DEVICES 0 -n 1 <APP> : -env CUDA_VISIBLE_DEVICES 1 -n 1 <APP>"
- On a batch node, login one of the compute nodes.
ssh $(jsrun -n 1 -r 1 hostname)
- On the compute node, run a CUDA-compiled version with
mpiexec
above.
Note: on Summit, the home directory is read-only from compute nodes.
-
.ssh/known_hosts
might cause an ssh related issue. You can aliasssh
/ addknown_hosts
manually to fix it. - You might need to run
examples/.libs/cpi
instead ofexamples/cpi
bash-4.2$ jsrun -n 1 -r 1 -a 1 -g 1 --smpiargs="-disable_gpu_hooks" ./cpi Abort(201935621) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_size: Invalid communicator, error stack: PMPI_Comm_size(109): MPI_Comm_size(comm=0x18d12a0, size=0x2000000b049c) failed PMPI_Comm_size(66).: Invalid communicator
It was caused by Darshan, use module unload darshan-runtime
to skip it.
Thanks. I will check it.
It worked. Thanks, @hzhou! I updated the MPICH wiki: https://wiki.mpich.org/mpich/index.php/Summit
I will close this issue.
Liked the Summit wiki!
After module unload darshan-runtime
, I see a different PMI error when running with mpich/main
+ jsrun
.
MPICH/main configure:
module load gcc
module load cuda/10.1.243
module unload darshan-runtime
yaksadir=$HOME/git/yaksa/build-cuda10.1.243/install
ucxdir=/autofs/nccs-svm1_home1/minsi/git/ucx-1.10.0/build-cuda10.1.243/install
../configure --prefix=$installdir CC=gcc CXX=g++ \
--disable-romio --disable-mpe --disable-ft-tests --disable-spawn --disable-fortran \
--disable-fast --enable-g=all \
--with-yaksa=$yaksadir \
--with-device=ch4:ucx --with-ucx=$ucxdir \
--disable-static --with-cuda=$CUDA_DIR \
--with-hwloc=embedded \
--with-pm=none --with-pmix=$MPI_ROOT \
CFLAGS=-std=gnu11
Note: $MPI_ROOT was set by spectrum-mpi/10.3.1.2-20200121
Compile test program:
$installdir/bin/mpiexec -o cpi ./cpi.c
Execution command with an interactive allocation (two ranks running on a single node)
jsrun -n 2 -r 2 -a 1 -g 1 ./cpi
Error
Abort at src/util/mpir_pmi.c line 1105
:
static int hex(unsigned char c)
{
if (c >= '0' && c <= '9') {
return c - '0';
} else if (c >= 'a' && c <= 'f') {
return 10 + c - 'a';
} else if (c >= 'A' && c <= 'F') {
return 10 + c - 'A';
} else {
MPIR_Assert(0); <<<< here
return -1;
}
}
Some debugging notes
- Core dump backtrace:
#0 0x000020000080fbf0 in raise () from /lib64/libc.so.6
#1 0x0000200000811f6c in abort () from /lib64/libc.so.6
#2 0x000020000054d39c in hex (c=201 '\311') at ../src/util/mpir_pmi.c:1108
#3 0x000020000054d4c8 in decode (size=514, src=0x42900720 "\311\320\340\360", dest=0x2000ce7d1200 "") at ../src/util/mpir_pmi.c:1126
#4 0x000020000054b700 in get_ex (src=1, key=0x7fffd6b85218 "-allgather-shm-1-1", buf=0x2000ce7d1000, p_size=0x7fffd6b8524c, is_local=0)
at ../src/util/mpir_pmi.c:474
#5 0x000020000054c568 in MPIR_pmi_allgather_shm (sendbuf=0x428ffef0, sendsize=893, shm_buf=0x2000ce7d0000, recvsize=4096, domain=MPIR_PMI_DOMAIN_ALL)
at ../src/util/mpir_pmi.c:701
#6 0x00002000005c7850 in MPIDU_bc_table_create (rank=1, size=2, nodemap=0x4232f430, bc=0x428ffef0, bc_len=893, same_len=0, roots_only=0,
bc_table=0x7fffd6b853b8, ret_bc_len=0x7fffd6b853c0) at ../src/mpid/common/bc/mpidu_bc.c:154
#7 0x00002000005aabec in initial_address_exchange (init_comm=0x0) at ../src/mpid/ch4/netmod/ucx/ucx_init.c:93
#8 0x00002000005abc50 in MPIDI_UCX_mpi_init_hook (rank=1, size=2, appnum=0, tag_bits=0x7fffd6b85504, init_comm=0x0)
at ../src/mpid/ch4/netmod/ucx/ucx_init.c:277
#9 0x00002000005aba7c in MPIDI_UCX_init_world (init_comm=0x0) at ../src/mpid/ch4/netmod/ucx/ucx_init.c:259
#10 0x0000200000558370 in MPID_Init_world () at ../src/mpid/ch4/src/ch4_init.c:624
#11 0x0000200000557614 in MPID_Init (requested=0, provided=0x2000007b4810 <MPIR_ThreadInfo>) at ../src/mpid/ch4/src/ch4_init.c:474
- Print exchanged key-value
put_ex: key=-allgather-shm-1-0, bufsize=893, n=1787, strlen=1786, encoded=00DEAC1D5A1824B7F24008E363C902BA22285DA7D377CC2B32004C3E5077CCAB33004F230088420E02800A0000C04108E363C902...
MPIR_pmi_kvs_get: key=-allgather-shm-1-0, strlen=1786, val_size=1024, pvalue->data.string=00DEAC1D5A1824B7F24008E363C902BA22285DA7D377CC2B32004C3E5077CCAB33004F230088420E02800A0000C04108E36...
get_ex: key=-allgather-shm-1-0, size=514, val=1737924512
- Guessed cause:
PMIx_Get
receives entire value (1786bytes) atoptimized_get->MPIR_pmi_kvs_get
, but the caller copied only 1024bytes which is limited bypmi_max_val_size
Naive fix
Increase pmi_max_val_size
at MPIR_pmi_init
when
#elif defined USE_PMIX_API
pmi_max_key_size = PMIX_MAX_KEYLEN;
- pmi_max_val_size = 1024; /* this is what PMI2_MAX_VALLEN currently set to */
+ pmi_max_val_size = 1024*16;
@raffenet Can you please suggest the right fix for the above PMIx bug? I did a naive fix (increasing value of pmi_max_val_size
) on summit and now mpich/main + jsrun
finally works.
TODO: Both hydra and jsrun works with mpich/main on Summit now. Going to write note to https://wiki.mpich.org/mpich/index.php/Summit
[DONE]
Changing the pmi_max_val_size
will still break if the business card exceeds the new limit, although seems unlikely today.
In put_ex
, we do the segmentation when #if defined(USE_PMI1_API) || defined(USE_PMI2_API)
, but not when USE_PMIX_API
. If you remove the #if -
switch, and always do the segmentation, will it work?
segmentation
does not seem to be the right solution to me. PMI1 and PMI2 had such approach because they have the PMI2_MAX_VALLEN
limit and require the user to provide the recv buffer.
pmi_errno = PMI_KVS_Get(pmi_kvs_name, key, val, val_size);
pmi_errno = PMI2_KVS_Get(pmi_jobid, src, key, val, val_size, &out_len);
But such a limit does not exist in PMIx anymore (I don't read PMIx spec careful enough, please correct me if wrong). And now the temporary recvbuf is allocated by PMIx internally.
pmix_value_t *pvalue;
PMIx_Get(&proc, key, NULL, 0, &pvalue);
// copy out from pvalue->data.string
An initial thought is that we might need modify get_ex
, so that data can be copied from pvalue->data.string
to the user recv buffer directly.
An initial thought is that we might need modify get_ex, so that data can be copied from pvalue->data.string to the user recv buffer directly.
The user still need allocate the recv buffer. I think the reason to have MAX_VALLEN
is not so much as PMI
can't deliver a huge message. It is mostly an interface thing. Without reasonable MAX_VALLEN
, we'll always need extra API for the user to work -- first query the size, allocate the buffer, then copy value out.
In fact, the very bug here is the recv buffer overflow, right?
An initial thought is that we might need modify
get_ex
, so that data can be copied frompvalue->data.string
to the user recv buffer directly.
Oh, the tricky part is we are not put/get the original message directly, we are transmitting the encoded message, which is bigger than the original message and thus won't fit into the user-allocated buffer. I guess if we can assume the encoded message is double the size of original message and thus allocate that size for recv buffer, and modify get_ex
, it probably can work. But honestly I don't think it is elegant either. The segmentation code is already there, why not just use it and keep the code simple?
If you worry about performance, we always can set MAX_VALEN
to bigger value, e.g. 16k. The segmentation code is a fail-safe, so our code is robust with unforeseen situations.
@hzhou I don't understand the PMI code well enough, thus cannot make a design decision now. I will try to spend more time on it and fix later. I guess the fix is not super urgent as we can workaround it by either increasing the buffer or switching to hydra on Summit.
One thing we can investigate with PMIx is using pmix_byte_object_t
rather than string type for the business cards. We may be able to skip the encode/decode step entirely and just send the raw address+size in a single step.
@raffenet why do we have the encode/decode steps in PMI1/PMI2?
@raffenet why do we have the encode/decode steps in PMI1/PMI2?
Because the PMI1/PMI2 protocol only handles ascii strings, I believe.
@raffenet why do we have the encode/decode steps in PMI1/PMI2?
Because the PMI1/PMI2 protocol only handles ascii strings, I believe.
That's right. Only PMIx supports binary blob data.
I tried to follow the instructions on the wiki but didn't get it working (trying both the commit mentioned on the wiki as well as current main
).
The error I get is:
jsrun --nrs 6 --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --smpiargs="-disable_gpu_hooks" ./myapp
[1642360280.626409] [h36n14:2999445:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999447:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999448:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999446:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999449:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999450:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
Abort(138006287) on node 3 (rank 3 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffc40ab630, argv=0x7fffc40ab638) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(272224015) on node 5 (rank 5 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffebfe66f0, argv=0x7fffebfe66f8) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(3788559) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff63599d0, argv=0x7ffff63599d8) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999450:0:2999450] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[h36n14:2999445:0:2999445] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(406441743) on node 4 (rank 4 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff5ab77a0, argv=0x7ffff5ab77a8) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999449:0:2999449] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(943312655) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff3b8e630, argv=0x7ffff3b8e638) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999447:0:2999447] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
As far as I can tell this is different from the errors reported so far. Shall I open a new issue or keep it here (as the issue title still fits).
@pgrete Which mpich version were you testing?
I tried commit 219a9006
mentioned in the wiki as well as main
from two days ago.