WarpX
WarpX copied to clipboard
simulation failed with MPICH error in PMPI_Send
One of my 512 node simulations on Perlmutter crashed with the following error:
MPICH ERROR [Rank 409] [job id 3072903.0] [Wed Aug 31 00:41:07 2022] [nid001780] - Abort(740940175) (rank 409 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(163).................: MPI_Send(buf=0x7fad8a147f90, count=1706944, MPI_CHAR, dest=408, tag=1615155, comm=0x84000003) failed
PMPI_Send(143).................:
MPIR_Wait_impl(41).............:
MPID_Progress_wait(184)........:
MPIDI_Progress_test(80)........:
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - CANCELED)
It did not produce a Backtrace
file. Could this be possibly connected to #3349?
The simulation still had
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=single:1
instead of
#SBATCH --gpus-per-node=4
# expose one GPU per MPI rank
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
I can provide the input script and submit script but this was part of a series of runs with either 128 or 512 nodes of which others run successfully and so I don't know exactly how to reproduce this in a smaller setup.
Can you please retry with #3349 applied and close the issue if it is resolved?
Looking at similar simulations (just slight variations in physics parameters and running on 2048 GPUs) I am getting the following warning after application of #3349:
Multiple GPUs are visible to each MPI rank, but the number of GPUs per socket or node has not been provided. This may lead to incorrect or suboptimal rank-to-GPU mapping.!
Unfortunately, these simulations also crashed, with the following errors:
terminate called after throwing an instance of 'std::length_error'
what(): cannot create std::vector larger than max_size()
and again
MPICH ERROR [Rank 1191] [job id 3088925.0] [Fri Sep 2 13:51:54 2022] [nid003053] - Abort(541707407) (rank 1191 in comm 0): Fatal error in PMPI_Reduce_scatter: Other MPI error, error stack:
PMPI_Reduce_scatter(606)..........................: MPI_Reduce_scatter(sbuf=0x2cc9eb10, rbuf=0x7ffca1fd9138, rcnts=0x2cca2b20, datatype=MPI_LONG, op=MPI_SUM, comm=comm=0x84000003) failed
MPIDI_Reduce_scatter_intra_composition_alpha(1295):
MPIDI_NM_mpi_reduce_scatter(623)..................:
MPIR_Reduce_scatter_intra_recursive_halving(239)..:
MPIC_Sendrecv(338)................................:
MPIC_Wait(71).....................................:
MPIR_Wait_impl(41)................................:
MPID_Progress_wait(184)...........................:
MPIDI_Progress_test(80)...........................:
MPIDI_OFI_handle_cq_error(1059)...................: OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - CANCELED)
@kngott @WeiqunZhang it looks like setting export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
does not work on Perlmutter?
Or srun
with --gpus-per-node=4
overwrites our export.
Fix proposed in #3375 to retry with :)