charm icon indicating copy to clipboard operation
charm copied to clipboard

MPI run-time issue with charm++ example

Open jscook2345 opened this issue 1 year ago • 5 comments

Hello,

I'm getting the following error when running one of the charm++ examples. Was looking for some guidance on how to debug the issue or ideas on what to try next.

Thanks,

Justin

Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 2 processes (PEs)
Converse/Charm++ Commit ID: v7.0.0
Charm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
Charm++> MPI timer is synchronized
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 hosts (2 sockets x 64 cores x 2 PUs = 256-way SMP)
Charm++> cpu topology info is gathered in 0.004 seconds.
Running Hello on 2 processors for 2000000 elements
MPICH ERROR [Rank 0] [job id 6753376.0] [Fri Mar 31 11:29:51 2023] [nid004265] - Abort(203531535) (rank 0 in comm 0): Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126).......: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=1375, comm=0x84000000, flag=0x7fff50e11494, status=0x7fff50e11480) failed
MPID_Iprobe(257).......:
MPIDI_iprobe_safe(118).:
MPIDI_iprobe_unsafe(42):
(unknown)(): Other MPI error

aborting job:
Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126).......: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=1375, comm=0x84000000, flag=0x7fff50e11494, status=0x7fff50e11480) failed
MPID_Iprobe(257).......:
MPIDI_iprobe_safe(118).:
MPIDI_iprobe_unsafe(42):
(unknown)(): Other MPI error
srun: error: nid004265: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=6753376.0
srun: error: nid004748: task 1: Terminated
srun: Force Terminated StepId=6753376.0

jscook2345 avatar Mar 31 '23 20:03 jscook2345

If it helps, this is on Perlmutter: https://www.nersc.gov/systems/perlmutter/

jscook2345 avatar Mar 31 '23 20:03 jscook2345

How did you build Charm++ (./build charm++ mpi-crayshasta ?), what modules are you using (PrgEnv, mpi, etc.), and what is your run command?

stwhite91 avatar Mar 31 '23 20:03 stwhite91

Build (tag v7.0.0):

./build charm++ mpi-crayshasta -g 
cd mpi-crayshasta/examples/charm++/hello/1darraymsg
make

Modules:

craype-x86-milan
libfabric/1.15.2.0
craype-network-ofi
xpmem/2.5.2-2.4_3.30__gd0f7936.shasta
PrgEnv-gnu/8.3.3
cray-dsmml/0.2.2
cray-libsci/23.02.1.1
cray-mpich/8.1.24
craype/2.7.19
gcc/11.2.0
perftools-base/23.02.0
cpe/23.02
xalt/2.10.2
cpu/1.0
cray-pmi/6.1.9
craype-hugepages8M

Run:

srun -C cpu -q debug -N 2 -n 2 --ntasks-per-node=1 -c 256 ./hello 2000000

jscook2345 avatar Mar 31 '23 20:03 jscook2345

That all looks correct. It looks to me like an issue in cray-mpich or libfabric below it since the parameters passed into the MPI_Iprobe() call all look to be valid. You could maybe try reproducing the error in a standalone MPI_Iprobe test program using the same environment?

stwhite91 avatar Mar 31 '23 20:03 stwhite91

I'll give that a try. Thanks for the initial look.

If I wanted to hook up a parallel debugger like ddt or totalview, do I need to do anything different because of kokkos?

Thanks,

Justin

jscook2345 avatar Apr 04 '23 01:04 jscook2345