serac icon indicating copy to clipboard operation
serac copied to clipboard

Fix usage of SLIC abort functionality

Open white238 opened this issue 2 years ago • 1 comments

There is a large PR (https://github.com/LLNL/axom/pull/868) in Axom to fix how SLIC handles error states. Currently it is not guaranteed that you will get the error messages or not hang on exit. After this PR we need to create a new exit function that does not do any collective calls (SLIC flushing for example) and also calls MPI_Abort(). This should be like axom::utilities::processAbort() but with a call to the new SLIC function slic::outputLocalMessages() since we are guaranteed to have SLIC, where Axom doesn't do that.

Also verify this works how I think it does:

SLIC_ERROR -> outputs error message -> registered SLIC abort function -> outputs local messages, non collectively -> doesn't hang all nodes

white238 avatar Aug 17 '22 18:08 white238

It apparently does not:

"This routine should not be used from within a signal handler." from https://www.mpich.org/static/docs/v3.1/www3/MPI_Abort.html

Might be something here:

https://www.mpich.org/static/docs/latest/www3/MPI_Comm_set_errhandler.html

white238 avatar Aug 17 '22 18:08 white238

Fixed in #778 and #751

white238 avatar Sep 12 '22 22:09 white238