axom icon indicating copy to clipboard operation
axom copied to clipboard

Don't call flushStreams() from Logger::logMessage() for Errors and Warnings

Open gzagaris opened this issue 5 years ago • 5 comments

The Logger::logMessage() method calls flushStreams() for ERROR and WARNING messages. In the context of MPI applications, that use MPI-aware LogStream instances, e.g., SynchronizedStream or LumberjackStream, the call to flushStreams() is a collective operation. This makes logging of all ERROR and WARNING messages a collective operation. However, this behavior is opaque to the application developer, making it easy to write erroneous code that hangs.

For example, the following will always hang:

SLIC_ERROR_IF( rank==0, "error from rank 0" );

Instead of hiding the call to flushStreams() inside the Logger::logMessage(), flushStreams() must be placed within application-level macros that better indicate that the call is collective.

For example:

SLIC_ALL_ERROR()
SLIC_ALL_WARNING()

Need some sort of asynchronous coordination across all MPI ranks

gzagaris avatar May 13 '19 15:05 gzagaris

Bump. I just ran into this.

SLIC_ASSERT is also not safe in an MPI context unless all threads are calling it.

kennyweiss avatar Jul 30 '21 02:07 kennyweiss

@kennyweiss any thought on how best to handle this sort of thing?

rhornung67 avatar Mar 04 '22 21:03 rhornung67

@kennyweiss any thought on how best to handle this sort of thing?

I just ran into this again last week. The problem is actually more serious than we originally thought. I.e. the problem is not that is "[makes] it easy to write erroneous code that hangs", as in the description above. Rather, it is flawed by when calling SLIC_ASSERT as designed.

If you call SLIC_ASSERT in an mpi run with an MPI-based LogStream, with a condition that is true on some ranks, but not others, it will cause a hang in the application since some ranks will call logMessage (with an internal flush) and others will not.

Specifically, calling SLIC_ASSERT: https://github.com/LLNL/axom/blob/6cbb6a1000cbd55711611fe4e6024f5c97ffaa55/src/axom/slic/interface/slic_macros.hpp#L169-L178

Will eventually invoke flushStreams on a subset of the ranks https://github.com/LLNL/axom/blob/0fbe605937207e4c60f427423fcdad59246fc828/src/axom/slic/core/Logger.cpp#L196-L225

Edit:

Proposed solution

I think the solution is to rework how logMessage and our macros (like SLIC_ASSERT) work. When SLIC_ASSERT or SLIC_ERRROR finds a bad solution, we should raise a flag that the application needs to flush and exit. MPI-based LogStreams should then broadcast the error and flush and exit together.

kennyweiss avatar Jun 07 '22 17:06 kennyweiss

This issue was erroneously closed with #811, but is still unresolved

kennyweiss avatar Jun 09 '22 18:06 kennyweiss

 #define SLIC_ASSERT(EXP)                                            \ 
   do                                                                \ 
   {                                                                 \ 
    // check EXP \
     if(< this rank failed>)                                                      \ 
     {                                                               \ 
       std::ostringstream __oss;                                     \ 
       __oss << "Failed Assert: " << #EXP << std::ends;              \ 
       axom::slic::logErrorMessage(__oss.str(), __FILE__, __LINE__); \ 
      // Remove flush and exit from above log message
     }                                                               \ 
    // Gather if any ranks failed
    // if ANY ranks then flush and exit all ranks
   } while(axom::slic::detail::false_value) 

white238 avatar Jul 11 '22 21:07 white238

Resolved in #868

kennyweiss avatar Sep 03 '22 01:09 kennyweiss