sst-core icon indicating copy to clipboard operation
sst-core copied to clipboard

Reported completed simulation time is 0s when cancelling simulation being run in serial

Open deanchester opened this issue 3 years ago • 1 comments

When you cancel SST (running without MPI or threads) during a run it reports that the simulation is complete and the simulated time is 0s despite the simulated being not being 0 like so:

 ^CEMERGENCY SHUTDOWN (0,0)!
 # Simulated time:                  1.06299 s
 EMERGENCY SHUTDOWN Complete (0,0)!
 Simulation is complete, simulated time: 0 s

I am using SST-Core commit 1c68395b30a836da477e98d43a42a6f92c5682e8 on OS X (High Sierra) compiled with Clang and OpenMPI 4.

I tried 3 different run configurations (MPI, Serial, Threaded) here is the output when exiting mid-simulation:

MPI:

deangchester@Deans-MBP-17:models-sst-11/exascale_applications ‹master*›$ mpirun -np 4 sst tealeaf_problem3_HE_64.py
36
Allreduce: ranks 288, loop 1, 1 double(s), latency 16926.795 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
^C[Deans-MBP-17:17799] *** An error occurred in MPI_Allreduce
[Deans-MBP-17:17799] *** reported by process [4218290177,1]
[Deans-MBP-17:17799] *** on communicator MPI_COMM_WORLD
[Deans-MBP-17:17799] *** MPI_ERR_TRUNCATE: message truncated
[Deans-MBP-17:17799] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Deans-MBP-17:17799] ***    and potentially your MPI job)
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.492 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
EMERGENCY SHUTDOWN (1,0)!
# Simulated time:                  252.551 ms

Serial:

deangchester@Deans-MBP-17:models-sst-11/exascale_applications ‹master*›$ sst tealeaf_problem3_HE_64.py
36
Allreduce: ranks 288, loop 1, 1 double(s), latency 16926.795 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.492 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
^CEMERGENCY SHUTDOWN (0,0)!
# Simulated time:                  1.06299 s
EMERGENCY SHUTDOWN Complete (0,0)!
Simulation is complete, simulated time: 0 s

Threaded:

deangchester@Deans-MBP-17:models-sst-11/exascale_applications ‹master*›$ sst -n 4 tealeaf_problem3_HE_64.py
36
Allreduce: ranks 288, loop 1, 1 double(s), latency 16926.795 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.263 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
Allreduce: ranks 288, loop 1, 1 double(s), latency 31232.491 us
2DHalo: total time 6504.262 us, loop 1, latency 3252.131 us.
Allreduce: ranks 288, loop 1, 1 double(s), latency 26028.885 us
^CEMERGENCY SHUTDOWN (0,0)!
# Simulated time:                  148.404 ms
^C

In the case of the threaded run it didn't exit and I had issue another kill interrupt to SST core to get it to exit - I left it for approximately 3 minutes to exit before issuing another interrupt.

This isn't a major problem; just thought I'd raise it incase any other users come across a similar issue.

deanchester avatar Mar 26 '21 12:03 deanchester

What's happening here is that the serial run is able to run through all the "post-run" code, whereas the parallel runs are not able to do that. The simulated time to the point of the interrupt is printed immediately, then the simulation tries to finish. Only the serial job can do that and is able to print the reported end simulation time (which is not reported because of the interrupt). You'll notice that a couple lines above the "Simulation is complete, simulated time: 0 s", you have "# Simulated time: 1.06299 s". That's the time at interrupt.

feldergast avatar Apr 19 '21 15:04 feldergast