sst-core
sst-core copied to clipboard
Statistics::StatisticsProcessingEngine is not thread safe
New Issue for sst-core
1 - Detailed description of problem or enhancement
- When a statistic is registered with a
stopat
parameter the StatisticsProcessingEngine adds it to a Oneshot handler list for all stats to be disabled at the same time. - This OneShot is added to the event queue of the calling thread, which is the first one to register a stat with that particular
stopat
value. - StatisticsEngine is a singleton, so all subsequent stat registrations with the same
stopat
time end up on a single OneShot, even if the parent components registering new stats are being handled by different threads. - When the thread which created the OneShot reaches it, all associated stats are disabled, even on Components in other threads
- If those other threads are behind in simulation time their stats are disabled early
2 - Describe how to reproduce the issue
- Create a statistic in each of several components, say an Accumulator
- Set the stop time for each stat the same from the Component c'tor, at 10s, for example
- Run with
--num_threads=2
under a debugger - Set three break points:
-
statengine.cc:413:StatisticProcessingEngine::setStatisticStopTime()
-
statengine.cc:421:StatisticProcessingEngine::setStatisticStopTime()
-
statengine.cc:574:StatisticProcessingEngine::handleStatisticEngineStopTimeEvent()
-
The first breakpoint will be hit exactly once: only a single OneShot is registered with the simulator, even though stats are registered from both threads.
The second breakpoint will be hit for every registered stat in both threads, showing that all stats are added to the same OneShot.
When the OneShot fires the third breakpoint will be hit for every registered stat in both threads, even though only one thread is at the stop time.
This bug results in stochastic errors, based on (wall clock) execution time differences between threads during the sync window during which the OneShot fires.
For example, use my Phold model, with this command:
$ for ((i=0; i<10; ++i)) ; do sst --num_threads=2 tests/phold.py -- -e 14 | grep error ; done
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 32, error: 1
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 33, error: 0
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 32, error: 1
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 33, error: 0
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 33, error: 0
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 33, error: 0
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 32, error: 1
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 32, error: 1
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 33, error: 0
10059869000:[0:0]:Phold-phold_0 [complete()] -> Grand total sends: 33, receives: 32, error: 1
In this model the bug shows up as total sends not matching total receives, because some of the Accumulators have been disabled prematurely. With 14 initial events in each of the two Phold Components this shows up in about half the runs. With fewer events it's rarer; with more it's more common.
3 - What Operating system(s) and versions NA
4 - What version of external libraries (Boost, MPI) NA
5 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc) Present in SST master @df2df5f9
There doesn't appear to be anything more recent which would change this behavior.
6 - Fill out Labels, Milestones, and Assignee fields as best possible