mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Issues on running MPICH on Summit

Open shintaro-iwasaki opened this issue 4 years ago • 20 comments

Problem

The current MPICH does not run in the current Summit environment with jsrun. It causes an error in PMIx_Init().

How to reproduce the issue

I used the latest MPICH and used the default MPI (IBM's spectrum MPI and its PMIx) and compile the following after loading CUDA and its GCC modules:

./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
    --with-pm=none --with-pmix=$MPI_ROOT
# if you use CUDA
# ./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
    --enable-gpu-tests-only --with-cuda=$CUDAPATH --with-pm=none --with-pmix=$MPI_ROOT

It shows the following error when I run cpi.

bash-4.2$ echo "$MPI_ROOT"
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-.../
bash-4.2$ jsrun -n 1 -r 1 -a 1 -g 1 --smpiargs="-disable_gpu_hooks" ./cpi
Abort(201935621) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(109): MPI_Comm_size(comm=0x18d12a0, size=0x2000000b049c) failed
PMPI_Comm_size(66).: Invalid communicator

When I use GDB on a compute node, an error seems in PMIx_Init(): supposedly, illegal memory access happened on initialization.

Note that libpmix is linked against:

libpmix.so.2 =>
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/spectrum-mpi-10.3.1.2-.../lib/libpmix.so.2

It seems that, with exactly the same configure settings, MPICH works over jsrun without any problem on a Summit-like IBM-based system (Ascent: https://docs.olcf.ornl.gov/systems/ascent_user_guide.html), so this might be a Summit-specific issue.

shintaro-iwasaki avatar Sep 29 '20 15:09 shintaro-iwasaki

Workaround (example: running two processes)

Workaround is manual host setting, which at least worked on Summit. The following worked with MPICH, 404cd8a920f4cda93d3e65dcd4e942948935a8c8.

  1. Compile MPICH with CUDA+UCX (installed to MPICH_CUDA_PATH)
# You also need to install newer libtool/autotool etc to compile MPICH.
module load cuda/11.0.3 gcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_CUDA_PATH --enable-ch4-am-only --enable-gpu-tests-only --with-cuda="$(realpath $(dirname $(which nvcc))/..)" CC=gcc CXX=gcc
  1. Compile MPICH without CUDA (installed to MPICH_NOCUDA_PATH) to get mpiexec that does not need CUDA. module load gcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_NOCUDA_PATH --enable-ch4-am-only CC=gcc CXX=gcc
  1. Allocate two nodes (then you will login a batch node)
bsub -W 2:00 -nnodes 2 -P csc371 -Is $SHELL
  1. On a batch node, run the following to get command after loading all the modules
# Get LD_LIBRARY_PATH on a compute node
# $ jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH
# Get a list of accessible hosts
# $ jsrun -n 2 -r 1 hostname | paste -d, -s -
echo "# two nodes, one process per node"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -host $(jsrun -n 2 -r 1 hostname | paste -d, -s -) -n 2 <APP>"
echo ""
echo "# one node, two processes"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -env CUDA_VISIBLE_DEVICES 0 -n 1 <APP> : -env CUDA_VISIBLE_DEVICES 1 -n 1 <APP>"
  1. On a batch node, login one of the compute nodes.
ssh $(jsrun -n 1 -r 1 hostname)
  1. On the compute node, run a CUDA-compiled version with mpiexec above.

Note: on Summit, the home directory is read-only from compute nodes.

  • .ssh/known_hosts might cause an ssh related issue. You can alias ssh/ add known_hosts manually to fix it.
  • You might need to run examples/.libs/cpi instead of examples/cpi

shintaro-iwasaki avatar Sep 29 '20 15:09 shintaro-iwasaki

bash-4.2$ jsrun -n 1 -r 1 -a 1 -g 1 --smpiargs="-disable_gpu_hooks" ./cpi Abort(201935621) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_size: Invalid communicator, error stack: PMPI_Comm_size(109): MPI_Comm_size(comm=0x18d12a0, size=0x2000000b049c) failed PMPI_Comm_size(66).: Invalid communicator

It was caused by Darshan, use module unload darshan-runtime to skip it.

hzhou avatar Sep 29 '20 15:09 hzhou

Thanks. I will check it.

shintaro-iwasaki avatar Sep 29 '20 16:09 shintaro-iwasaki

It worked. Thanks, @hzhou! I updated the MPICH wiki: https://wiki.mpich.org/mpich/index.php/Summit

I will close this issue.

shintaro-iwasaki avatar Oct 06 '20 20:10 shintaro-iwasaki

Liked the Summit wiki!

minsii avatar Oct 09 '20 16:10 minsii

After module unload darshan-runtime, I see a different PMI error when running with mpich/main + jsrun.

MPICH/main configure:

module load gcc
module load cuda/10.1.243
module unload darshan-runtime

yaksadir=$HOME/git/yaksa/build-cuda10.1.243/install
ucxdir=/autofs/nccs-svm1_home1/minsi/git/ucx-1.10.0/build-cuda10.1.243/install

../configure --prefix=$installdir CC=gcc  CXX=g++ \
  --disable-romio --disable-mpe --disable-ft-tests --disable-spawn --disable-fortran                       \
  --disable-fast --enable-g=all \
  --with-yaksa=$yaksadir    \
  --with-device=ch4:ucx  --with-ucx=$ucxdir  \
  --disable-static --with-cuda=$CUDA_DIR  \
  --with-hwloc=embedded \
  --with-pm=none --with-pmix=$MPI_ROOT \
  CFLAGS=-std=gnu11

Note: $MPI_ROOT was set by spectrum-mpi/10.3.1.2-20200121

Compile test program:

$installdir/bin/mpiexec -o cpi ./cpi.c

Execution command with an interactive allocation (two ranks running on a single node)

jsrun -n 2 -r 2 -a 1 -g 1 ./cpi

Error

Abort at src/util/mpir_pmi.c line 1105:

static int hex(unsigned char c)
{
    if (c >= '0' && c <= '9') {
        return c - '0';
    } else if (c >= 'a' && c <= 'f') {
        return 10 + c - 'a';
    } else if (c >= 'A' && c <= 'F') {
        return 10 + c - 'A';
    } else {
        MPIR_Assert(0); <<<< here
        return -1;
    }
}

Some debugging notes

  • Core dump backtrace:
#0  0x000020000080fbf0 in raise () from /lib64/libc.so.6
#1  0x0000200000811f6c in abort () from /lib64/libc.so.6
#2  0x000020000054d39c in hex (c=201 '\311') at ../src/util/mpir_pmi.c:1108
#3  0x000020000054d4c8 in decode (size=514, src=0x42900720 "\311\320\340\360", dest=0x2000ce7d1200 "") at ../src/util/mpir_pmi.c:1126
#4  0x000020000054b700 in get_ex (src=1, key=0x7fffd6b85218 "-allgather-shm-1-1", buf=0x2000ce7d1000, p_size=0x7fffd6b8524c, is_local=0)
    at ../src/util/mpir_pmi.c:474
#5  0x000020000054c568 in MPIR_pmi_allgather_shm (sendbuf=0x428ffef0, sendsize=893, shm_buf=0x2000ce7d0000, recvsize=4096, domain=MPIR_PMI_DOMAIN_ALL)
    at ../src/util/mpir_pmi.c:701
#6  0x00002000005c7850 in MPIDU_bc_table_create (rank=1, size=2, nodemap=0x4232f430, bc=0x428ffef0, bc_len=893, same_len=0, roots_only=0,
    bc_table=0x7fffd6b853b8, ret_bc_len=0x7fffd6b853c0) at ../src/mpid/common/bc/mpidu_bc.c:154
#7  0x00002000005aabec in initial_address_exchange (init_comm=0x0) at ../src/mpid/ch4/netmod/ucx/ucx_init.c:93
#8  0x00002000005abc50 in MPIDI_UCX_mpi_init_hook (rank=1, size=2, appnum=0, tag_bits=0x7fffd6b85504, init_comm=0x0)
    at ../src/mpid/ch4/netmod/ucx/ucx_init.c:277
#9  0x00002000005aba7c in MPIDI_UCX_init_world (init_comm=0x0) at ../src/mpid/ch4/netmod/ucx/ucx_init.c:259
#10 0x0000200000558370 in MPID_Init_world () at ../src/mpid/ch4/src/ch4_init.c:624
#11 0x0000200000557614 in MPID_Init (requested=0, provided=0x2000007b4810 <MPIR_ThreadInfo>) at ../src/mpid/ch4/src/ch4_init.c:474
  • Print exchanged key-value
put_ex: key=-allgather-shm-1-0, bufsize=893, n=1787, strlen=1786, encoded=00DEAC1D5A1824B7F24008E363C902BA22285DA7D377CC2B32004C3E5077CCAB33004F230088420E02800A0000C04108E363C902...
MPIR_pmi_kvs_get: key=-allgather-shm-1-0, strlen=1786, val_size=1024, pvalue->data.string=00DEAC1D5A1824B7F24008E363C902BA22285DA7D377CC2B32004C3E5077CCAB33004F230088420E02800A0000C04108E36...
get_ex: key=-allgather-shm-1-0, size=514, val=1737924512
  • Guessed cause: PMIx_Get receives entire value (1786bytes) at optimized_get->MPIR_pmi_kvs_get, but the caller copied only 1024bytes which is limited by pmi_max_val_size

Naive fix

Increase pmi_max_val_size at MPIR_pmi_init when

 #elif defined USE_PMIX_API
     pmi_max_key_size = PMIX_MAX_KEYLEN;
-    pmi_max_val_size = 1024;    /* this is what PMI2_MAX_VALLEN currently set to */
+    pmi_max_val_size = 1024*16; 

minsii avatar Mar 12 '21 00:03 minsii

@raffenet Can you please suggest the right fix for the above PMIx bug? I did a naive fix (increasing value of pmi_max_val_size ) on summit and now mpich/main + jsrun finally works.

minsii avatar Mar 12 '21 00:03 minsii

TODO: Both hydra and jsrun works with mpich/main on Summit now. Going to write note to https://wiki.mpich.org/mpich/index.php/Summit

[DONE]

minsii avatar Mar 12 '21 00:03 minsii

Changing the pmi_max_val_size will still break if the business card exceeds the new limit, although seems unlikely today.

In put_ex, we do the segmentation when #if defined(USE_PMI1_API) || defined(USE_PMI2_API), but not when USE_PMIX_API. If you remove the #if - switch, and always do the segmentation, will it work?

hzhou avatar Mar 12 '21 01:03 hzhou

segmentation does not seem to be the right solution to me. PMI1 and PMI2 had such approach because they have the PMI2_MAX_VALLEN limit and require the user to provide the recv buffer.

pmi_errno = PMI_KVS_Get(pmi_kvs_name, key, val, val_size);
pmi_errno = PMI2_KVS_Get(pmi_jobid, src, key, val, val_size, &out_len);

But such a limit does not exist in PMIx anymore (I don't read PMIx spec careful enough, please correct me if wrong). And now the temporary recvbuf is allocated by PMIx internally.

pmix_value_t *pvalue;
PMIx_Get(&proc, key, NULL, 0, &pvalue);
// copy out from pvalue->data.string

An initial thought is that we might need modify get_ex, so that data can be copied from pvalue->data.string to the user recv buffer directly.

minsii avatar Mar 12 '21 02:03 minsii

An initial thought is that we might need modify get_ex, so that data can be copied from pvalue->data.string to the user recv buffer directly.

The user still need allocate the recv buffer. I think the reason to have MAX_VALLEN is not so much as PMI can't deliver a huge message. It is mostly an interface thing. Without reasonable MAX_VALLEN, we'll always need extra API for the user to work -- first query the size, allocate the buffer, then copy value out.

In fact, the very bug here is the recv buffer overflow, right?

hzhou avatar Mar 12 '21 03:03 hzhou

An initial thought is that we might need modify get_ex, so that data can be copied from pvalue->data.string to the user recv buffer directly.

Oh, the tricky part is we are not put/get the original message directly, we are transmitting the encoded message, which is bigger than the original message and thus won't fit into the user-allocated buffer. I guess if we can assume the encoded message is double the size of original message and thus allocate that size for recv buffer, and modify get_ex, it probably can work. But honestly I don't think it is elegant either. The segmentation code is already there, why not just use it and keep the code simple?

If you worry about performance, we always can set MAX_VALEN to bigger value, e.g. 16k. The segmentation code is a fail-safe, so our code is robust with unforeseen situations.

hzhou avatar Mar 12 '21 03:03 hzhou

@hzhou I don't understand the PMI code well enough, thus cannot make a design decision now. I will try to spend more time on it and fix later. I guess the fix is not super urgent as we can workaround it by either increasing the buffer or switching to hydra on Summit.

minsii avatar Mar 12 '21 18:03 minsii

One thing we can investigate with PMIx is using pmix_byte_object_t rather than string type for the business cards. We may be able to skip the encode/decode step entirely and just send the raw address+size in a single step.

raffenet avatar Mar 12 '21 21:03 raffenet

@raffenet why do we have the encode/decode steps in PMI1/PMI2?

minsii avatar Mar 13 '21 00:03 minsii

@raffenet why do we have the encode/decode steps in PMI1/PMI2?

Because the PMI1/PMI2 protocol only handles ascii strings, I believe.

hzhou avatar Mar 13 '21 00:03 hzhou

@raffenet why do we have the encode/decode steps in PMI1/PMI2?

Because the PMI1/PMI2 protocol only handles ascii strings, I believe.

That's right. Only PMIx supports binary blob data.

raffenet avatar Mar 15 '21 13:03 raffenet

I tried to follow the instructions on the wiki but didn't get it working (trying both the commit mentioned on the wiki as well as current main). The error I get is:

jsrun --nrs 6  --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --smpiargs="-disable_gpu_hooks" ./myapp
[1642360280.626409] [h36n14:2999445:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999447:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999448:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999446:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999449:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999450:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
Abort(138006287) on node 3 (rank 3 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffc40ab630, argv=0x7fffc40ab638) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(272224015) on node 5 (rank 5 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffebfe66f0, argv=0x7fffebfe66f8) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(3788559) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff63599d0, argv=0x7ffff63599d8) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999450:0:2999450] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[h36n14:2999445:0:2999445] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(406441743) on node 4 (rank 4 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff5ab77a0, argv=0x7ffff5ab77a8) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999449:0:2999449] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(943312655) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff3b8e630, argv=0x7ffff3b8e638) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999447:0:2999447] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

As far as I can tell this is different from the errors reported so far. Shall I open a new issue or keep it here (as the issue title still fits).

pgrete avatar Jan 16 '22 20:01 pgrete

@pgrete Which mpich version were you testing?

hzhou avatar Jan 18 '22 16:01 hzhou

I tried commit 219a9006 mentioned in the wiki as well as main from two days ago.

pgrete avatar Jan 18 '22 17:01 pgrete