axom
axom copied to clipboard
Don't call flushStreams() from Logger::logMessage() for Errors and Warnings
The Logger::logMessage() method calls flushStreams()
for ERROR
and WARNING
messages. In the context of MPI applications, that use MPI-aware LogStream instances, e.g., SynchronizedStream
or LumberjackStream
, the call to flushStreams()
is a collective operation. This makes logging of all ERROR
and WARNING
messages a collective operation. However, this behavior is opaque to the application developer, making it easy to write erroneous code that hangs.
For example, the following will always hang:
SLIC_ERROR_IF( rank==0, "error from rank 0" );
Instead of hiding the call to flushStreams()
inside the Logger::logMessage()
, flushStreams()
must be placed within application-level macros that better indicate that the call is collective.
For example:
SLIC_ALL_ERROR()
SLIC_ALL_WARNING()
Need some sort of asynchronous coordination across all MPI ranks
Bump. I just ran into this.
SLIC_ASSERT
is also not safe in an MPI context unless all threads are calling it.
@kennyweiss any thought on how best to handle this sort of thing?
@kennyweiss any thought on how best to handle this sort of thing?
I just ran into this again last week.
The problem is actually more serious than we originally thought. I.e. the problem is not that is "[makes] it easy to write erroneous code that hangs", as in the description above. Rather, it is flawed by when calling SLIC_ASSERT
as designed.
If you call SLIC_ASSERT
in an mpi run with an MPI-based LogStream
, with a condition that is true on some ranks, but not others, it will cause a hang in the application since some ranks will call logMessage
(with an internal flush
) and others will not.
Specifically, calling SLIC_ASSERT
:
https://github.com/LLNL/axom/blob/6cbb6a1000cbd55711611fe4e6024f5c97ffaa55/src/axom/slic/interface/slic_macros.hpp#L169-L178
Will eventually invoke flushStreams
on a subset of the ranks
https://github.com/LLNL/axom/blob/0fbe605937207e4c60f427423fcdad59246fc828/src/axom/slic/core/Logger.cpp#L196-L225
Edit:
Proposed solution
I think the solution is to rework how logMessage
and our macros (like SLIC_ASSERT
) work.
When SLIC_ASSERT
or SLIC_ERRROR
finds a bad solution, we should raise a flag that the application needs to flush and exit. MPI-based LogStreams should then broadcast the error and flush and exit together.
This issue was erroneously closed with #811, but is still unresolved
#define SLIC_ASSERT(EXP) \
do \
{ \
// check EXP \
if(< this rank failed>) \
{ \
std::ostringstream __oss; \
__oss << "Failed Assert: " << #EXP << std::ends; \
axom::slic::logErrorMessage(__oss.str(), __FILE__, __LINE__); \
// Remove flush and exit from above log message
} \
// Gather if any ranks failed
// if ANY ranks then flush and exit all ranks
} while(axom::slic::detail::false_value)
Resolved in #868