sst-core
sst-core copied to clipboard
--enable-perf-tracking --enable-profile crashes
[Edit to fix phold make command]
New Issue for sst-core
1 - Detailed description of problem or enhancement
With --enable-perf-tracking --enable-profile
during configure several of my models crash with seg faults. I have not been able to reproduce with the car wash example in simpleSimulation. I can reproduce with a simple PHOLD model (see below).
2 - Describe how to reproduce the issue Use my PHOLD model
- First, configure, build and install sst-core: 1.a Patch upstream 83ddb8ca with these commits
(Or 1.a' Use my wip branch)
1.b Configure, build and install sst-elements (so we can get the library directory by sst-config SST_ELEMENT_LIBRARY SST_ELEMENT_LIBRARY_LIBDIR
)
- Get and build my PHOLD:
$ git clone https://github.com/pdbj/sst-phold
$ cd sst-phold
$ make debug install
# Check that it installed correctly:
$ sst-info phold
PROCESSED 1 .so (SST ELEMENT) FILES FOUND IN DIRECTORY(s) /Users/barnes26/Code/SST/build/profile/lib/sstcore:/usr/local/lib/sst-elements-library
Filtering output on Element = "phold"
================================================================================
ELEMENT 0 = phold ()
Num Components = 2
Component 0: Phold
...
- Run with high verbosity:
$ sst tests/phold.py -- -vvv
phold.py : Creating PHOLD Benchmark
phold.py : Importing SST module
phold.py : Creating 2 LPs
..
phold.py : Creating complete graph with latency 1 s (1 total)
.
phold.py : Enabling statistics at level 1
phold.py : Done
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12479c0, id: 0, name: phold_0
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:271)] -> [2]
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:278)] -> [3] TIMEFACTOR: 0.000001, timeConverter factor: 1000000, period: 1 us (0.000001 s?)
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:290)] -> [3] min: 1 s, duty: 100 m, df: 0.100000
0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:296)] -> [3] m_ev: 1, ev_win: 0.100000, min_ev_win: 10.000000, min_ev: 100
0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> PHOLD Configuration:
Remote LP fraction: 0.9
Minimum inter-event delay: 1 s
Additional exponential average delay: 9 s
Stop time: 10 s
Number of LPs: 2
Number of initial events per LP: 1
Average events per window: 0.1
(Too low! Suggest setting '--events=100')
Expected total number of events: 2
Output delay histogram: no
Sampling: rng
Optimization level: debug
Verbosity level: 3
0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> SST Configuration:
Rank, thread: 0, 0
Total ranks, threads: 1, 1
Run mode: BOTH
0:[0:0]:Phold-phold_0 [ShowSizes() (Phold.cc:360)] -> [2]
0:[0:0]:Phold-phold_0 [ShowSizes()] -> Sizes of objects:
Phold: 432 (class instance)
Plus heap allocated:
SST::RNG::MersenneRNG: 24 (m_rng)
SST::RNG::MarsagliaRNG: 16 (m_remRng)
SST::RNG::SSTUniformDistribution: 32 (m_nodeRng)
SST::RNG::SSTExponentialDistribution: 32 (m_delayRNg)
SST::Statistics::AccumulatorStatistic<uint64_t>: 384 (m_sendCount)
SST::Statistics::AccumulatorStatistic<uint64_t>: 384 (m_recvCount)
SST::Statistics::HistogramStatistic<uint64_t>: 464 (m_delays)
(Bins are stored in a map, so additional 3 * 8 bytes per bin.)
Subtotal heap allocated: : 1336
SST::Link: 64 (N * (N - 1) links total)
Other components:
SST::UnitAlgebra: 96 (statics TIMEBASE, m_average)
SST::TimeConverter: 8 (static m_timeConverter)
SST::Output: 144 (m_output, included in Phold)
SST::Core::ThreadSafe::Barrier: 64 (many instances in Simulator_impl)
std::atomic<bool>: 1 (used by Barrier)
std::atomic<std::size_t>: 8 (used by Barrier)
std::string: 32 (VERBOSE_PREFIX, included in Phold)
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:152)] -> [3] Configuring links:
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:184)] -> [3] Initializing statistics
0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:188)] -> [3] Setting stopat to 10 s
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12e6c50, id: 1, name: phold_1
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:152)] -> [3] Configuring links:
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:184)] -> [3] Initializing statistics
0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:188)] -> [3] Setting stopat to 10 s
0:[0:0]:Phold-phold_1 [init() (Phold.cc:596)] -> [2] depth: 1, phase: 0, begin: 0, end: 1
0:[0:0]:Phold-phold_1 [init() (Phold.cc:601)] -> [3] checking for early events
0:[0:0]:Phold-phold_1 [checkForEvents() (Phold.cc:554)] -> [3] checking link 0
0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:541)] -> [3] getting event from link 0
0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:543)] -> [3] got (nil)
0:[0:0]:Phold-phold_1[quartz770:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
The segfault occurs in SST::Link::recvUntimedData()
:
Backtrace:
sst(SST::Link::recvUntimedData()+0x48)[0x8d5cd4]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::InitEvent* Phold::Phold::getEvent<Phold::InitEvent>(unsigned long)+0x62)[0x2aaab7e4154e]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(void Phold::Phold::checkForEvents<Phold::InitEvent>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xc4)[0x2aaab7e41680]
/g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::Phold::init(unsigned int)+0x22b)[0x2aaab7e3913f]
sst(SST::Simulation_impl::initialize()+0x110)[0x8e7dce]
sst[0x825cd8]
sst(main+0x1ab7)[0x827964]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaad18c555]
sst[0x8245b9]
But I don't think that's the source of the problem: if I comment out all delete event
in my code, it progresses further, but still eventually seg faults.
3 - Diagnosis
I think the problem arises when the normal event lifetime pattern encounters event profiling.
Here is the "normal" pattern:
// From model code: create and send an event:
auto event = new Event(...);
link->send(delay, event);
// In simulator.cc:Simulation_impl::run():685
while ( LIKELY(!endSim) ) {
current_activity = timeVortex->pop();
currentSimCycle = current_activity->getDeliveryTime();
currentPriority = current_activity->getPriority();
current_activity->execute();
// In event.cc:Event::execute():59:
(*functor)
(this);
// where functor is the event handler
// Finally, back in model code, the event handler
void ...::handleEvent(SST::Event *ev)
{
auto event = dynamic_cast<...*>(ev);
// Extract data from the event
// Done with the event, delete it, to match the new in the sending code
delete event;
}
Before profiling was introduced (or without it enabled) deleting the Event in the handler was harmless, if a little odd. It amounted to delete this;
. See this isocpp FAQ
But with profiling enabled, Event::execute()
continues after calling the handler:
void
Event::execute(void)
{
#if SST_EVENT_PROFILING
SST_EVENT_PROFILE_START
#endif
(*functor)
(this);
#if SST_EVENT_PROFILING
Simulation_impl* sim = Simulation_impl::getSimulation();
SST_EVENT_PROFILE_STOP
// Track sending and receiving counters
auto eventCount = sim->eventRecvCounters.find(getLastComponentName());
...
The killer is getLastComponentName()
, a member function of Event
, which is bad enough, but it goes on to return a member variable, last_comp
. Boom!
4 - Suggested fix
Suggest pulling all required member variables in to local variables before calling the functor. It appears this could be limited to the return values of getFirstComponentName()
and getLastComponentName()
, but really all the lookups and registrations could be done in the preamble SST_EVENT_PROFILE_START
I haven't looked at the other places affected by profiling to see if there are similar issues.
5 - What Operating system(s) and versions Two RHL variants, Mac
6 - What version of external libraries (Boost, MPI) Not relevant, running sequentially.
7 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc) sst-core: 83ddb8ca (with my patches) based on upstream df2df5f9
I've found a fix, see my wip branch at 9ba01bf.
This is sufficient, but it might not be minimal; at least it bounds the solution space.
@pdbj, I believe this patch is somewhat stale as well. It fails on src/sst/core/event.cc
@pdbj, the patch code in event.cc
is still too far behind devel
to merge. The upstream devel
branch has diverged significantly. You might want to check the changes from https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43
The commit https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 appears to be to master, not devel. In any case that commit makes trivial changes (Simulation::getSimulation
--> Simulation_impl::getSimulation
). My edits to event.cc
also use the latter, and don't touch the two places https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 does, so there shouldn't be a merge conflict.
@pdbj, its definitely a merge conflict. If you look at the current event.cc code on the devel
branch, the references to SST_EVENT_PROFILE_REG
do not exist. As a result, the standard git patch application functions fail to successfully patch/merge the code.
Apologies, you actually need two commits:
then
Fix crash due to reference deleted event state
If I squash these now I'll have to update the rebase links in all the other issues, so I'd rather not...
@pdbj, that makes more sense. I'll try to pull both patches and incrementally merge them into the current devel tree.
@pdbj, one other thing to note when submitting patches/PRs. Make sure and run the scripts/clang-format-test.sh
script with clang-format-12 prior to committing changes. The PRs will fail the frontend tests if the code is not formatted using the embedded clang format template.
merged into devel