flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

Fatal error in `PMPI_Mrecv`

Open dongahn opened this issue 4 years ago • 1 comments

The Aha Moles team is getting the following MPI_Abort issue. This could be an application issue, but I am still creating this issue ticket as a placeholder to get more details.

There are many of this same step completing successfully; however, after running the loop awhile, they get this error (each time I run the loop):

0.526s: flux-shell[0]: FATAL: MPI_Abort: Fatal error in PMPI_Mrecv:
0.526s: flux-shell[0]: stderr: flux-shell: FATAL: MPI_Abort: Fatal error in PMPI_Mrecv:
0.527s: job.exception type=exec severity=0 MPI_Abort: Fatal error in PMPI_Mrecv:
flux-job: task(s) exited with exit code 1
2021-11-16T21:14:17.546453Z broker.err[0]: rc2.0: /bin/bash -c /p/vast1/atom/gbgmd/gmd_loop_pipeline/job_output_corona/job_20211116-46930/docking_adapter/job_corona121_62709_72/docking/ligand_prep/ligand_prep.flux.sh Exited (rc=1) 1.1s

This is happening in context of a flux script (on corona using flux 0.29) which works fine when running manually.

dongahn avatar Nov 16 '21 22:11 dongahn

Tagging @jmast.

dongahn avatar Nov 17 '21 01:11 dongahn