WarpX icon indicating copy to clipboard operation
WarpX copied to clipboard

simulation failed with MPICH error in PMPI_Send

Open n01r opened this issue 2 years ago • 4 comments

One of my 512 node simulations on Perlmutter crashed with the following error:

MPICH ERROR [Rank 409] [job id 3072903.0] [Wed Aug 31 00:41:07 2022] [nid001780] - Abort(740940175) (rank 409 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(163).................: MPI_Send(buf=0x7fad8a147f90, count=1706944, MPI_CHAR, dest=408, tag=1615155, comm=0x84000003) failed
PMPI_Send(143).................: 
MPIR_Wait_impl(41).............: 
MPID_Progress_wait(184)........: 
MPIDI_Progress_test(80)........: 
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - CANCELED)

It did not produce a Backtrace file. Could this be possibly connected to #3349? The simulation still had

#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=single:1

instead of

#SBATCH --gpus-per-node=4

# expose one GPU per MPI rank
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID

I can provide the input script and submit script but this was part of a series of runs with either 128 or 512 nodes of which others run successfully and so I don't know exactly how to reproduce this in a smaller setup.

n01r avatar Aug 31 '22 21:08 n01r

Can you please retry with #3349 applied and close the issue if it is resolved?

ax3l avatar Sep 06 '22 22:09 ax3l

Looking at similar simulations (just slight variations in physics parameters and running on 2048 GPUs) I am getting the following warning after application of #3349:

Multiple GPUs are visible to each MPI rank, but the number of GPUs per socket or node has not been provided. This may lead to incorrect or suboptimal rank-to-GPU mapping.!

Unfortunately, these simulations also crashed, with the following errors:

terminate called after throwing an instance of 'std::length_error'
  what():  cannot create std::vector larger than max_size()

and again

MPICH ERROR [Rank 1191] [job id 3088925.0] [Fri Sep  2 13:51:54 2022] [nid003053] - Abort(541707407) (rank 1191 in comm 0): Fatal error in PMPI_Reduce_scatter: Other MPI error, error stack:
PMPI_Reduce_scatter(606)..........................: MPI_Reduce_scatter(sbuf=0x2cc9eb10, rbuf=0x7ffca1fd9138, rcnts=0x2cca2b20, datatype=MPI_LONG, op=MPI_SUM, comm=comm=0x84000003) failed
MPIDI_Reduce_scatter_intra_composition_alpha(1295):
MPIDI_NM_mpi_reduce_scatter(623)..................:
MPIR_Reduce_scatter_intra_recursive_halving(239)..:
MPIC_Sendrecv(338)................................:
MPIC_Wait(71).....................................:
MPIR_Wait_impl(41)................................:
MPID_Progress_wait(184)...........................:
MPIDI_Progress_test(80)...........................:
MPIDI_OFI_handle_cq_error(1059)...................: OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - CANCELED)

n01r avatar Sep 06 '22 22:09 n01r

@kngott @WeiqunZhang it looks like setting export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID does not work on Perlmutter?

Or srun with --gpus-per-node=4 overwrites our export.

ax3l avatar Sep 07 '22 17:09 ax3l

Fix proposed in #3375 to retry with :)

ax3l avatar Sep 08 '22 23:09 ax3l