sst-core
sst-core copied to clipboard
--enable-perf-tracking --enable-profile crashes
[Edit to fix phold make command]
New Issue for sst-core
1 - Detailed description of problem or enhancement
With --enable-perf-tracking --enable-profile during configure several of my models crash with seg faults. I have not been able to reproduce with the car wash example in simpleSimulation. I can reproduce with a simple PHOLD model (see below).
2 - Describe how to reproduce the issue Use my PHOLD model
- First, configure, build and install sst-core: 1.a Patch upstream 83ddb8ca with these commits
(Or 1.a' Use my wip branch)
1.b Configure, build and install sst-elements (so we can get the library directory by sst-config SST_ELEMENT_LIBRARY SST_ELEMENT_LIBRARY_LIBDIR)
- Get and build my PHOLD:
$ git clone https://github.com/pdbj/sst-phold
$ cd sst-phold
$ make debug install
# Check that it installed correctly:
$ sst-info phold
PROCESSED 1 .so (SST ELEMENT) FILES FOUND IN DIRECTORY(s) /Users/barnes26/Code/SST/build/profile/lib/sstcore:/usr/local/lib/sst-elements-library
Filtering output on Element = "phold"
================================================================================
ELEMENT 0 = phold ()
Num Components = 2
Component 0: Phold
...
- Run with high verbosity:
$ sst tests/phold.py -- -vvv
phold.py : Creating PHOLD Benchmark
phold.py : Importing SST module
phold.py : Creating 2 LPs
..
phold.py : Creating complete graph with latency 1 s (1 total)
.
phold.py : Enabling statistics at level 1
phold.py : Done
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12479c0, id: 0, name: phold_0
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:271)] -> [2]
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:278)] -> [3] TIMEFACTOR: 0.000001, timeConverter factor: 1000000, period: 1 us (0.000001 s?)
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:290)] -> [3] min: 1 s, duty: 100 m, df: 0.100000
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:296)] -> [3] m_ev: 1, ev_win: 0.100000, min_ev_win: 10.000000, min_ev: 100
0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> PHOLD Configuration:
Remote LP fraction: 0.9
Minimum inter-event delay: 1 s
Additional exponential average delay: 9 s
Stop time: 10 s
Number of LPs: 2
Number of initial events per LP: 1
Average events per window: 0.1
(Too low! Suggest setting '--events=100')
Expected total number of events: 2
Output delay histogram: no
Sampling: rng
Optimization level: debug
Verbosity level: 3
0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> SST Configuration:
Rank, thread: 0, 0
Total ranks, threads: 1, 1
Run mode: BOTH
0:[0:0]:Phold-phold_0 [ShowSizes() (Phold.cc:360)] -> [2]
0:[0:0]:Phold-phold_0 [ShowSizes()] -> Sizes of objects:
Phold: 432 (class instance)
Plus heap allocated:
SST::RNG::MersenneRNG: 24 (m_rng)
SST::RNG::MarsagliaRNG: 16 (m_remRng)
SST::RNG::SSTUniformDistribution: 32 (m_nodeRng)
SST::RNG::SSTExponentialDistribution: 32 (m_delayRNg)
SST::Statistics::AccumulatorStatistic<uint64_t>: 384 (m_sendCount)
SST::Statistics::AccumulatorStatistic<uint64_t>: 384 (m_recvCount)
SST::Statistics::HistogramStatistic<uint64_t>: 464 (m_delays)
(Bins are stored in a map, so additional 3 * 8 bytes per bin.)
Subtotal heap allocated: : 1336
SST::Link: 64 (N * (N - 1) links total)
Other components:
SST::UnitAlgebra: 96 (statics TIMEBASE, m_average)
SST::TimeConverter: 8 (static m_timeConverter)
SST::Output: 144 (m_output, included in Phold)
SST::Core::ThreadSafe::Barrier: 64 (many instances in Simulator_impl)
std::atomic<bool>: 1 (used by Barrier)
std::atomic<std::size_t>: 8 (used by Barrier)
std::string: 32 (VERBOSE_PREFIX, included in Phold)
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:152)] -> [3] Configuring links:
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:184)] -> [3] Initializing statistics
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:188)] -> [3] Setting stopat to 10 s
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12e6c50, id: 1, name: phold_1
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:152)] -> [3] Configuring links:
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:184)] -> [3] Initializing statistics
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:188)] -> [3] Setting stopat to 10 s
0:[0:0]:Phold-phold_1 [init() (Phold.cc:596)] -> [2] depth: 1, phase: 0, begin: 0, end: 1
0:[0:0]:Phold-phold_1 [init() (Phold.cc:601)] -> [3] checking for early events
0:[0:0]:Phold-phold_1 [checkForEvents() (Phold.cc:554)] -> [3] checking link 0
0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:541)] -> [3] getting event from link 0
0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:543)] -> [3] got (nil)
0:[0:0]:Phold-phold_1[quartz770:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
The segfault occurs in SST::Link::recvUntimedData():
Backtrace:
sst(SST::Link::recvUntimedData()+0x48)[0x8d5cd4]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::InitEvent* Phold::Phold::getEvent<Phold::InitEvent>(unsigned long)+0x62)[0x2aaab7e4154e]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(void Phold::Phold::checkForEvents<Phold::InitEvent>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xc4)[0x2aaab7e41680]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::Phold::init(unsigned int)+0x22b)[0x2aaab7e3913f]
sst(SST::Simulation_impl::initialize()+0x110)[0x8e7dce]
sst[0x825cd8]
sst(main+0x1ab7)[0x827964]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaad18c555]
sst[0x8245b9]
But I don't think that's the source of the problem: if I comment out all delete event in my code, it progresses further, but still eventually seg faults.
3 - Diagnosis
I think the problem arises when the normal event lifetime pattern encounters event profiling.
Here is the "normal" pattern:
// From model code: create and send an event:
auto event = new Event(...);
link->send(delay, event);
// In simulator.cc:Simulation_impl::run():685
while ( LIKELY(!endSim) ) {
current_activity = timeVortex->pop();
currentSimCycle = current_activity->getDeliveryTime();
currentPriority = current_activity->getPriority();
current_activity->execute();
// In event.cc:Event::execute():59:
(*functor)
(this);
// where functor is the event handler
// Finally, back in model code, the event handler
void ...::handleEvent(SST::Event *ev)
{
auto event = dynamic_cast<...*>(ev);
// Extract data from the event
// Done with the event, delete it, to match the new in the sending code
delete event;
}
Before profiling was introduced (or without it enabled) deleting the Event in the handler was harmless, if a little odd. It amounted to delete this;. See this isocpp FAQ
But with profiling enabled, Event::execute() continues after calling the handler:
void
Event::execute(void)
{
#if SST_EVENT_PROFILING
SST_EVENT_PROFILE_START
#endif
(*functor)
(this);
#if SST_EVENT_PROFILING
Simulation_impl* sim = Simulation_impl::getSimulation();
SST_EVENT_PROFILE_STOP
// Track sending and receiving counters
auto eventCount = sim->eventRecvCounters.find(getLastComponentName());
...
The killer is getLastComponentName(), a member function of Event, which is bad enough, but it goes on to return a member variable, last_comp. Boom!
4 - Suggested fix
Suggest pulling all required member variables in to local variables before calling the functor. It appears this could be limited to the return values of getFirstComponentName() and getLastComponentName(), but really all the lookups and registrations could be done in the preamble SST_EVENT_PROFILE_START
I haven't looked at the other places affected by profiling to see if there are similar issues.
5 - What Operating system(s) and versions Two RHL variants, Mac
6 - What version of external libraries (Boost, MPI) Not relevant, running sequentially.
7 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc) sst-core: 83ddb8ca (with my patches) based on upstream df2df5f9
I've found a fix, see my wip branch at 9ba01bf.
This is sufficient, but it might not be minimal; at least it bounds the solution space.
@pdbj, I believe this patch is somewhat stale as well. It fails on src/sst/core/event.cc
@pdbj, the patch code in event.cc is still too far behind devel to merge. The upstream devel branch has diverged significantly. You might want to check the changes from https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43
The commit https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 appears to be to master, not devel. In any case that commit makes trivial changes (Simulation::getSimulation --> Simulation_impl::getSimulation). My edits to event.cc also use the latter, and don't touch the two places https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 does, so there shouldn't be a merge conflict.
@pdbj, its definitely a merge conflict. If you look at the current event.cc code on the devel branch, the references to SST_EVENT_PROFILE_REG do not exist. As a result, the standard git patch application functions fail to successfully patch/merge the code.
Apologies, you actually need two commits:
then
Fix crash due to reference deleted event state
If I squash these now I'll have to update the rebase links in all the other issues, so I'd rather not...
@pdbj, that makes more sense. I'll try to pull both patches and incrementally merge them into the current devel tree.
@pdbj, one other thing to note when submitting patches/PRs. Make sure and run the scripts/clang-format-test.sh script with clang-format-12 prior to committing changes. The PRs will fail the frontend tests if the code is not formatted using the embedded clang format template.
merged into devel