mpich icon indicating copy to clipboard operation
mpich copied to clipboard

'_pmi2_kvs_get failed' error when using mpich on NERSC's perlmutter

Open jjepson2 opened this issue 2 years ago • 36 comments

I am getting the following two errors when trying to use mpich on NERSC's perlmutter

PMI2_KVS_Get:_pmi2_kvs_get failed _pmi2_kvs_get:key [-NONEXIST-KEY] was not found.

Do you know what might be causing this??

Thanks.

jjepson2 avatar Sep 21 '23 16:09 jjepson2

It looks like NERSC has this documentation:

" The following error messages always appear at the end of a run

│ Wed May 10 08:28:13 2023: [PE_6]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found. │ Wed May 10 08:28:13 2023: [PE_6]:PMI2_KVS_Get:_pmi2_kvs_get failed

Even successful runs will output the above message. There will be 2 such messages per task. "

However, I am getting this message in the middle of the run.

jjepson2 avatar Sep 21 '23 16:09 jjepson2

I think one of the auxillary programs we are using was compiled with the cray-mpi version. I will see if fixing that fixes this.

jjepson2 avatar Sep 21 '23 16:09 jjepson2

Still getting this error at the first of my run. And then the job hangs for the rest of the run after stating the error message.

jjepson2 avatar Sep 22 '23 14:09 jjepson2

@thomasgillis I don't know if this is related to the error with cray-mpich. I am using 8192 processes accross 128 nodes on perlmutter.

jjepson2 avatar Sep 22 '23 14:09 jjepson2

You can ignore these errors. The -NONEXIST-KEY key is a hack for PMI_Barrier. PMI2 does not provide PMI_Barrier. To use PMI2_KVS_Fence as a barrier requires a PMI2_KVS_Get to trigger the wait synchronization. We intentionally fetch a non-existent key -NONEXIT-KEY and it is expected to return non found. The message is from the PMI server and it is harmless.

hzhou avatar Sep 22 '23 14:09 hzhou

@jjepson2 I think the hang is more interesting to understand. how did you compile mpich? Here is what I have used on my side:

CC=gcc CXX=g++ ./configure  \
--prefix=$PWD/_inst \
--with-device=ch4:ofi \
--with-libfabric=/opt/cray/libfabric/1.15.2.0 \
--disable-fortran \
--disable-cxx \
--enable-threads=multiple \
--enable-thread-cs=per-vci\
--enable-g=most

Also, can you pin-point where does it hang? like if you put MPI_Barrier() before and after each main communication phases, which function does not work?

thomasgillis avatar Sep 22 '23 14:09 thomasgillis

@thomasgillis I am using the prebuilt (open-source) mpich on perlmutter. located at /global/common/software/nersc/pe/gnu/mpich/4.1.1/lib on their machine.

jjepson2 avatar Sep 22 '23 14:09 jjepson2

I am trying to diagnose exactly where it hangs, so will let you know here shortly.

jjepson2 avatar Sep 22 '23 14:09 jjepson2

@thomasgillis It hangs at the same time it gives the stated error message is all I know currently.

jjepson2 avatar Sep 22 '23 14:09 jjepson2

@thomasgillis Scratch that. It hangs afterwords. Still diagnosing.

jjepson2 avatar Sep 22 '23 14:09 jjepson2

Now it failed on mpi_init for some reason. with a whole slew of errors. I don't know if it's just that mpi_init was taking too long or something. I'll try again. Here's a sample of the errors just so you can see a few of them, but I'm going to try again, because I haven't seen this before:

inet_connect:socket error state Connection timed out node = nid006863 controller = nid004225 controller ip = 10.249.1.134 Fri Sep 22 07:53:36 2023: [PE_5952]:conn_to_controller:inet_connect failed Fri Sep 22 07:53:36 2023: [PE_5952]:control_nets_join:_pmi_inet_setup (full) returned -1 Abort(672257039): Fatal error in internal_Init: Other MPI error, error stack: internal_Init(66)....: MPI_Init(argc=(nil), argv=(nil)) failed MPII_Init_thread(168): MPIR_pmi_init(160)...: PMI_Init returned 1 Fri Sep 22 07:53:36 2023: [PE_6080]:inet_recv:inet_recv: recv error on nid006880 from nid006863 (fd=4) Connection reset by peer Fri Sep 22 07:53:36 2023: [PE_608Fri Sep 22 07:53:36 2023: [PE_5952]:inet_connect:socket error state Connection timed out node = nid006863 controller = nid004225 controller ip = 10.249.1.134 Fri Sep 22 07:53:36 2023: [PE_5952]:conn_to_controller:inet_connect failed Fri Sep 22 07:53:36 2023: [PE_5952]:control_nets_join:_pmi_inet_setup (full) returned -1 Abort(672257039): Fatal error in internal_Init: Other MPI error, error stack: internal_Init(66)....: MPI_Init(argc=(nil), argv=(nil)) failed MPII_Init_thread(168): MPIR_pmi_init(160)...: PMI_Init returned 1

jjepson2 avatar Sep 22 '23 14:09 jjepson2

do you have a reproducer or is the code available?

thomasgillis avatar Sep 22 '23 15:09 thomasgillis

So I found out that you would have to fill out a user agreement and satisfy export control laws and then we can grant access. I may do that if needed, but I'm just going to try again, because sometimes the NERSC nodes are slow to start up, and so I think the mpi_init error could be just because it was taking too long.

jjepson2 avatar Sep 22 '23 15:09 jjepson2

we should be able to do that. In the meantime, I would also strongly encourage you to download and install the latest version of mpich. We had issues with cray and slurm launchers and that should be fixed.

If you git clone the repo, here are the commands I have used yesterday on perlmutter to compile mpich and install it in ${HOME}/mpich/_inst.

# get an interactive node
salloc --nodes 1 --time 01:00:00 --constraint cpu --qos interactive 
# only the first time
git submodule update --init --recursive
./autogen.sh -yaksa-depth=0

# configure
CC=gcc CXX=g++ ./configure  \
--prefix=${HOME}/mpich/_inst \
--with-device=ch4:ofi \
--with-libfabric=/opt/cray/libfabric/1.15.2.0 \
--disable-fortran \
--disable-cxx \
--enable-threads=multiple \
--enable-thread-cs=per-vci\
--enable-g=most

Then you can simply use ${HOME}/mpich/_inst/bin/mpiexec -n ${SLURM_NTASKS} -ppn 128 --bind-to core ./you_program to run. That will get rid of the slurm/cray launcher layer and also help the debugging :-)

thomasgillis avatar Sep 22 '23 15:09 thomasgillis

Ok awesome. I will let you know what I find on this.

jjepson2 avatar Sep 22 '23 15:09 jjepson2

Does this error mean anything? I got this on my latest run through of the code (right before the code crashed):

libfabric:203115:1695419415::cxi:core:cxip_send_rdzv_put_cb():3688 nid005507: TXC (0x11253:32:0):: Fatal, unexpected event received: REPLY

jjepson2 avatar Sep 22 '23 21:09 jjepson2

Yes, could you rerun with FI_LOG_LEVEL=Debug? And provide the whole log?

thomasgillis avatar Sep 22 '23 22:09 thomasgillis

I ran with that option. The file is 2966351 lines long. (Partly because each of the 8192 processes writes the same error message to the file). Do you want me to provide a link to the file?

jjepson2 avatar Sep 25 '23 14:09 jjepson2

Here's a link to the file: https://drive.google.com/file/d/1uu9pGkog7rJrXcgn5XUxNGKjVCW2w0HH/view?usp=sharing

jjepson2 avatar Sep 25 '23 14:09 jjepson2

It looks like it is dying (and giving all those errors) on an mpi_wait statement.

jjepson2 avatar Sep 25 '23 14:09 jjepson2

I thinned the file to remove duplicate error messages. Assuming I did that right, this file should be easier to read:

https://drive.google.com/file/d/1Td1UfVlY5BKlVJRRe9kXLizkzNNMvTzp/view?usp=drive_link

jjepson2 avatar Sep 25 '23 14:09 jjepson2

it's hard to understand what is going on without looking at the code but here is what I can guess

1695433619::core:mr:ofi_monitor_subscribe():474<warn> Failed (ret = -14) to monitor addr=0x3a97000 len=4096
1695433619::core:mr:ofi_monitor_subscribe():474<warn> Failed (ret = -14) to monitor addr=0x3a98000 len=4096
1695433620::core:mr:ofi_monitor_subscribe():474<warn> Failed (ret = -14) to monitor addr=0x3a97000 len=4096
1695433620::core:mr:ofi_monitor_subscribe():474<warn> Failed (ret = -14) to monitor addr=0x3a98000 len=4096
1695433647::cxi:core:cxip_send_rdzv_put_cb():3688<warn> nid005116: TXC (0x10800:38:0):: Fatal, unexpected event received: REPLY
1695433647::cxi:core:cxip_send_rdzv_put_cb():3688<warn> nid005860: TXC (0x11b82:36:0):: Fatal, unexpected event received: REPLY
1695433647::cxi:cq:cxip_cq_req_free_no_lock():177<warn> nid004981: Failed to unmap request: 0x90c2850
1695433648::cxi:core:cxip_send_rdzv_put_cb():3688<warn> nid004492: TXC (0xf2a3:49:0):: Fatal, unexpected event received: REPLY

means that your memory has an issue (error 14 it the EFAULT if I am correct (https://github.com/ofiwg/libfabric/blob/d8a978190dc14fd67f0fc076298c359482681f96/include/rdma/fi_errno.h#L60) When you call MPI_Wait we progress the requests and the error happens in the rendezvous protocol when the provider tries to access the data. I would assume that you have freed the data or moved it. Are you sure your data is still around when you call MPI_Wait?

thomasgillis avatar Sep 25 '23 16:09 thomasgillis

I'm looking at that. Unfortunately the mpi_wait statement is in a linked code that we link for use in NIMROD. (It's the matrix solver we use, superlu_dist). So I am trying to look at their code to see if something is happening with the data before the mpi_wait.

jjepson2 avatar Sep 25 '23 18:09 jjepson2

What happens if mpi_wait is called with a null or inactive request argument? Does mpich allow a call like that? (the open-mpi mpi_wait seems to allow that, so just wondering)

jjepson2 avatar Sep 25 '23 19:09 jjepson2

It should be fine. The error you get is generated when reading the actual buffer of data :-) the way rendezvous works usually is along the lines of (1) register the memory, (2) send the memory address to the receiver, (3) receiver reads the memory. In your case, it seems that (3) failing because of an out-of-memory access (error 14). so you should look at something like wrong arguments of size and count, or memory which has been freed.

Does your simulation go through with OpenMPI?

thomasgillis avatar Sep 25 '23 20:09 thomasgillis

Ok, I will keep looking at that. I haven't tried OpenMPI yet.

jjepson2 avatar Sep 25 '23 20:09 jjepson2

both OpenMPI and MPICH (and crayMPI) are going to use the provided libfabric library, so you should be able to compile and run if you think it would help

thomasgillis avatar Sep 25 '23 20:09 thomasgillis

When you mention your third point above, are you saying the receiver is trying to read the memory on its (later) mpi_receive call?? Because I thought mpi_wait is just to make sure that the buffer on the sender's end is free to use.

jjepson2 avatar Sep 25 '23 20:09 jjepson2

so typically (1) and (2) will happen when calling MPI_Isend (or MPI_Start if persistent send/recv), but the (3) is going to happen during MPI_Wait. Once returned from MPI_Wait, the buffer is indeed free to be used

thomasgillis avatar Sep 25 '23 20:09 thomasgillis

Oh, I see. Yeah, that makes sense. So the memory it's reading is corrupted by then.

jjepson2 avatar Sep 25 '23 20:09 jjepson2