flux-core
flux-core copied to clipboard
Fatal error in `PMPI_Mrecv`
The Aha Moles team is getting the following MPI_Abort issue. This could be an application issue, but I am still creating this issue ticket as a placeholder to get more details.
There are many of this same step completing successfully; however, after running the loop awhile, they get this error (each time I run the loop):
0.526s: flux-shell[0]: FATAL: MPI_Abort: Fatal error in PMPI_Mrecv:
0.526s: flux-shell[0]: stderr: flux-shell: FATAL: MPI_Abort: Fatal error in PMPI_Mrecv:
0.527s: job.exception type=exec severity=0 MPI_Abort: Fatal error in PMPI_Mrecv:
flux-job: task(s) exited with exit code 1
2021-11-16T21:14:17.546453Z broker.err[0]: rc2.0: /bin/bash -c /p/vast1/atom/gbgmd/gmd_loop_pipeline/job_output_corona/job_20211116-46930/docking_adapter/job_corona121_62709_72/docking/ligand_prep/ligand_prep.flux.sh Exited (rc=1) 1.1s
This is happening in context of a flux script (on corona using flux 0.29) which works fine when running manually.
Tagging @jmast.