sst-core icon indicating copy to clipboard operation
sst-core copied to clipboard

--enable-perf-tracking --enable-profile crashes

Open pdbj opened this issue 2 years ago • 11 comments

[Edit to fix phold make command]

New Issue for sst-core

1 - Detailed description of problem or enhancement With --enable-perf-tracking --enable-profile during configure several of my models crash with seg faults. I have not been able to reproduce with the car wash example in simpleSimulation. I can reproduce with a simple PHOLD model (see below).

2 - Describe how to reproduce the issue Use my PHOLD model

  1. First, configure, build and install sst-core: 1.a Patch upstream 83ddb8ca with these commits

(Or 1.a' Use my wip branch)

1.b Configure, build and install sst-elements (so we can get the library directory by sst-config SST_ELEMENT_LIBRARY SST_ELEMENT_LIBRARY_LIBDIR)

  1. Get and build my PHOLD:
$ git clone https://github.com/pdbj/sst-phold
$ cd sst-phold
$ make debug install

# Check that it installed correctly:
$ sst-info phold
PROCESSED 1 .so (SST ELEMENT) FILES FOUND IN DIRECTORY(s) /Users/barnes26/Code/SST/build/profile/lib/sstcore:/usr/local/lib/sst-elements-library
Filtering output on Element = "phold"
================================================================================
ELEMENT 0 = phold ()
Num Components = 2
      Component 0: Phold
...
  1. Run with high verbosity:
$ sst tests/phold.py -- -vvv

phold.py :  Creating PHOLD Benchmark
phold.py :  Importing SST module
phold.py :  Creating 2 LPs
..
phold.py :  Creating complete graph with latency 1 s (1 total)
.
phold.py :  Enabling statistics at level 1
phold.py :  Done

0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12479c0, id: 0, name: phold_0
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:271)] -> [2] 
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:278)] -> [3]   TIMEFACTOR: 0.000001, timeConverter factor: 1000000, period: 1 us (0.000001 s?)
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:290)] -> [3]   min: 1 s, duty: 100 m, df: 0.100000
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:296)] -> [3]   m_ev: 1, ev_win: 0.100000, min_ev_win: 10.000000, min_ev: 100
0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> PHOLD Configuration:
    Remote LP fraction:                   0.9
    Minimum inter-event delay:            1 s
    Additional exponential average delay: 9 s
    Stop time:                            10 s
    Number of LPs:                        2
    Number of initial events per LP:      1
    Average events per window:            0.1
      (Too low!  Suggest setting '--events=100')
    Expected total number of events:      2
    Output delay histogram:               no
    Sampling:                             rng
    Optimization level:                   debug
    Verbosity level:                      3

0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> SST Configuration:
    Rank, thread:                         0, 0
    Total ranks, threads:                 1, 1
    Run mode:                             BOTH

0:[0:0]:Phold-phold_0 [ShowSizes() (Phold.cc:360)] -> [2] 
0:[0:0]:Phold-phold_0 [ShowSizes()] -> Sizes of objects:
    Phold:                                            432 (class instance)

    Plus heap allocated:
    SST::RNG::MersenneRNG:                             24 (m_rng)
    SST::RNG::MarsagliaRNG:                            16 (m_remRng)
    SST::RNG::SSTUniformDistribution:                  32 (m_nodeRng)
    SST::RNG::SSTExponentialDistribution:              32 (m_delayRNg)
    SST::Statistics::AccumulatorStatistic<uint64_t>:  384 (m_sendCount)
    SST::Statistics::AccumulatorStatistic<uint64_t>:  384 (m_recvCount)
    SST::Statistics::HistogramStatistic<uint64_t>:    464 (m_delays)
      (Bins are stored in a map, so additional 3 * 8 bytes per bin.)
    Subtotal heap allocated: :                        1336
    SST::Link:                                         64 (N * (N - 1) links total)

    Other components:
    SST::UnitAlgebra:                                  96 (statics TIMEBASE, m_average)
    SST::TimeConverter:                                 8 (static m_timeConverter)
    SST::Output:                                      144 (m_output, included in Phold)
    SST::Core::ThreadSafe::Barrier:                    64 (many instances in Simulator_impl)
    std::atomic<bool>:                                  1 (used by Barrier)
    std::atomic<std::size_t>:                           8 (used by Barrier)
    std::string:                                       32 (VERBOSE_PREFIX, included in Phold)

0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:152)] -> [3] Configuring links:
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:184)] -> [3] Initializing statistics
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:188)] -> [3]   Setting stopat to 10 s
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12e6c50, id: 1, name: phold_1
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:152)] -> [3] Configuring links:
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:184)] -> [3] Initializing statistics
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:188)] -> [3]   Setting stopat to 10 s
0:[0:0]:Phold-phold_1 [init() (Phold.cc:596)] -> [2] depth: 1, phase: 0, begin: 0, end: 1
0:[0:0]:Phold-phold_1 [init() (Phold.cc:601)] -> [3]   checking for early events
0:[0:0]:Phold-phold_1 [checkForEvents() (Phold.cc:554)] -> [3]   checking link 0
0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:541)] -> [3]     getting event from link 0
0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:543)] -> [3]     got (nil)
0:[0:0]:Phold-phold_1[quartz770:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)

The segfault occurs in SST::Link::recvUntimedData():

Backtrace:
sst(SST::Link::recvUntimedData()+0x48)[0x8d5cd4]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::InitEvent* Phold::Phold::getEvent<Phold::InitEvent>(unsigned long)+0x62)[0x2aaab7e4154e]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(void Phold::Phold::checkForEvents<Phold::InitEvent>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xc4)[0x2aaab7e41680]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::Phold::init(unsigned int)+0x22b)[0x2aaab7e3913f]
sst(SST::Simulation_impl::initialize()+0x110)[0x8e7dce]
sst[0x825cd8]
sst(main+0x1ab7)[0x827964]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaad18c555]
sst[0x8245b9]

But I don't think that's the source of the problem: if I comment out all delete event in my code, it progresses further, but still eventually seg faults.

3 - Diagnosis

I think the problem arises when the normal event lifetime pattern encounters event profiling.

Here is the "normal" pattern:

// From model code: create and send an event:
auto event = new Event(...);
link->send(delay, event);

// In simulator.cc:Simulation_impl::run():685
    while ( LIKELY(!endSim) ) {
        current_activity = timeVortex->pop();
        currentSimCycle  = current_activity->getDeliveryTime();
        currentPriority  = current_activity->getPriority();
        current_activity->execute();

// In event.cc:Event::execute():59:
        (*functor)
    (this);
// where functor is the event handler

// Finally, back in model code, the event handler 
void ...::handleEvent(SST::Event *ev)
{
  auto event = dynamic_cast<...*>(ev);
  // Extract data from the event

  // Done with the event, delete it, to match the new in the sending code
  delete event;
}

Before profiling was introduced (or without it enabled) deleting the Event in the handler was harmless, if a little odd. It amounted to delete this;. See this isocpp FAQ

But with profiling enabled, Event::execute() continues after calling the handler:

void
Event::execute(void)
{

#if SST_EVENT_PROFILING
    SST_EVENT_PROFILE_START
#endif

        (*functor)
    (this);

#if SST_EVENT_PROFILING
    Simulation_impl* sim = Simulation_impl::getSimulation();

    SST_EVENT_PROFILE_STOP

    // Track sending and receiving counters
    auto eventCount = sim->eventRecvCounters.find(getLastComponentName());
...

The killer is getLastComponentName(), a member function of Event, which is bad enough, but it goes on to return a member variable, last_comp. Boom!

4 - Suggested fix

Suggest pulling all required member variables in to local variables before calling the functor. It appears this could be limited to the return values of getFirstComponentName() and getLastComponentName(), but really all the lookups and registrations could be done in the preamble SST_EVENT_PROFILE_START

I haven't looked at the other places affected by profiling to see if there are similar issues.

5 - What Operating system(s) and versions Two RHL variants, Mac

6 - What version of external libraries (Boost, MPI) Not relevant, running sequentially.

7 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc) sst-core: 83ddb8ca (with my patches) based on upstream df2df5f9

pdbj avatar Mar 12 '22 00:03 pdbj

I've found a fix, see my wip branch at 9ba01bf.

This is sufficient, but it might not be minimal; at least it bounds the solution space.

pdbj avatar Mar 23 '22 00:03 pdbj

@pdbj, I believe this patch is somewhat stale as well. It fails on src/sst/core/event.cc

jleidel avatar May 09 '22 21:05 jleidel

@pdbj, the patch code in event.cc is still too far behind devel to merge. The upstream devel branch has diverged significantly. You might want to check the changes from https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43

jleidel avatar May 15 '22 15:05 jleidel

The commit https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 appears to be to master, not devel. In any case that commit makes trivial changes (Simulation::getSimulation --> Simulation_impl::getSimulation). My edits to event.cc also use the latter, and don't touch the two places https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 does, so there shouldn't be a merge conflict.

pdbj avatar May 16 '22 16:05 pdbj

@pdbj, its definitely a merge conflict. If you look at the current event.cc code on the devel branch, the references to SST_EVENT_PROFILE_REG do not exist. As a result, the standard git patch application functions fail to successfully patch/merge the code.

jleidel avatar May 16 '22 16:05 jleidel

Apologies, you actually need two commits:

Pre-register event profiling

then

Fix crash due to reference deleted event state

If I squash these now I'll have to update the rebase links in all the other issues, so I'd rather not...

pdbj avatar May 16 '22 16:05 pdbj

@pdbj, that makes more sense. I'll try to pull both patches and incrementally merge them into the current devel tree.

jleidel avatar May 16 '22 17:05 jleidel

@pdbj, one other thing to note when submitting patches/PRs. Make sure and run the scripts/clang-format-test.sh script with clang-format-12 prior to committing changes. The PRs will fail the frontend tests if the code is not formatted using the embedded clang format template.

jleidel avatar May 18 '22 15:05 jleidel

merged into devel

jleidel avatar May 19 '22 15:05 jleidel