sst-elements
sst-elements copied to clipboard
Wrong instruction ordering when simulating MiniMD with Ariel
Explanation
When simulating MiniMD with Ariel, after the simulation has run for several minutes, it will eventually reach a point where the arielcore receives an instruction it does not expect. This error occurs when an unexpected instruction occurs after an ARIEL_START_INSTRUCTION
.
After the core sees ARIEL_START_INSTRUCTION
, it expects either ARIEL_PERFORM_READ
, ARIEL_PERFORM_WRITE
or ARIEL_END_INSTRUCTION
. Sometimes, it will instead receive an ARIEL_NOOP
, or an additional ARIEL_START_INSTRUCTION
, crashing the simulation here. After examining ariel/frontend/pin3/fesimple.cc
, it seems it should not be possible for either of those instruction orderings to occur.
Example
The command that breaks this is non-determinstic, but here is an example ordering that crashes the simulation. The unique IP bits are included.
Instruction IP
ARIEL_START_INSTRUCTION 0x4b09
ARIEL_NOOP 0x2d82 # Crash here
ARIEL_PERFORM_READ 0x4b09
Notably, a READ
that seems to be associated with the unmatched START
arrives after a NOOP
.
After some tinkering, I also encountered a scenario in which the program crashes here because a READ
arrives before its corresponding START
. This seems to indicate that the error is not limited to NOOP
instructions.
This to me indicates some sort of race condition, meaning this may be an issue with ArielTunnel
or its parent classes.
More Info
This in a single-threaded execution. SST is running on a single thread, and the traced program is using a single thread. The server is an Intel Ice Lake server running RHEL 8.7. I'm running the MiniMD ref version. The commits I'm using are:
- SST-Elements https://github.com/plavin/sst-elements/commit/b6e9d9c761677b9caa24dc4a2c15c2a8fbe00d21 (branched from https://github.com/sstsimulator/sst-elements/commit/f5fcb56a3abb88dfaf661e0a31df21de7b172a55)
- SST-Core https://github.com/sstsimulator/sst-core/commit/e70e231f097c24c5f9d6b4ec5d1b3cd217a5f6a4
If someone could try replicating this, I would be very appreciative. The SST-Elements commit I linked to will output the instructions before and after the one that crashes the simulation.
The above was configured with Pin 3.23. I have confirmed that this persists with the most recent commits to the master branches:
- https://github.com/sstsimulator/sst-core/commit/7870ad6251c6ec68a3795ae1f5caa336f1090b27
- https://github.com/sstsimulator/sst-elements/commit/0c07b17363b702ec51486af9055762ff14a6fc5b
- Pin 3.26 (The newest supported version)
It seems that if Ariel is unmodified, the error always occurs because of an instruction occurring in between a START
and an END
that shouldn't be there, such as another START
or a NOOP
. The other scenario I mentioned in the original post (about a READ
occurring before a START
) seems to just be an artifact of my changes.
Perhaps this means the error is just in the frontend code and not in the tunnel.
Dug into this a bit more. Replicated the issue with the following settings:
- ran on our testing server
- recent commits of elements and core
-
ariel_snb.py
- MiniMD ref, openmpi, SIMD=no, DEBUG=yes
I traced calls to the tunnel. I found the following data was written to the tunnel. In the brackets you'll see the sequence of commands written, and in parentheses you see the issue. Legal instruction orderings, according to arielcore.cc
, are [32 (2, 4)*, 64]
. That is, [START, (READ, WRITE)*, END]
. But we are seeing 128 (NOOP) following 32 before a 64, and seeing 32 follow a 32 before a 64. (While the core will allow for reads and writes as part of the same instruction, fesimple will only produce reads or writes.)
[ 32 128 4 128 64] (128 between 32 and 64)
[32 32 2 4 64 64] (two instructions interleaved)
[ 32 128 4 64] (128 between 32 and 64)
So the issue is with how tunnel->writeMessage
is called by the frontend. These instruction orderings should not be possible, based on how the frontend is written. Perhaps the bug is in Pin?
Is it possible that the inserted calls to writeMessage
get rearranged somehow?
If I were to monitor what was being written to the tunnel, I could identify when an error occurs. Would it be possible to get a stack trace from miniMD at this point so I can see which function is causing the issue?
This seems to be related to including mpi.h
in the program, even if it isn't launched with mpirun. I ran into this issue again with the following program:
#include <cstdio>
#include <cstdlib>
#include <mpi.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int rank = 0;
int ret = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (ret != MPI_SUCCESS) {
printf("Error: MPI_Comm_rank retuned error: %d\n", ret);
exit(1);
}
printf("Hello from rank %d!", rank);
for (int i = 1; i < argc; i++) {
printf(" -- %s", argv[i]);
}
printf("\n");
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf("Rank 0: Barrier complete.\n");
}
MPI_Finalize();
}
If you remove the MPI components, the error goes away.
Here is the sdl file for completeness:
import sst
#########################################################################
## Define SST core options
#########################################################################
# If this demo gets to 100ms, something has gone very wrong!
sst.setProgramOption("stop-at", "100ms")
#########################################################################
## Declare components
#########################################################################
core = sst.Component("core", "ariel.ariel")
cache = sst.Component("cache", "memHierarchy.Cache")
memctrl = sst.Component("memory", "memHierarchy.MemController")
#########################################################################
## Set component parameters and fill subcomponent slots
#########################################################################
# Core: 2.4GHz, 2 accesses/cycle, STREAM (triad) pattern generator with 1000 elements per array
core.addParams({
"clock" : "2.4GHz",
"verbose" : 1,
#"executable" : "./hello-nompi"
"executable" : "./hello"
})
# Cache: L1, 2.4GHz, 2KB, 4-way set associative, 64B lines, LRU replacement, MESI coherence
cache.addParams({
"L1" : 1,
"cache_frequency" : "2.4GHz",
"access_latency_cycles" : 2,
"cache_size" : "2KiB",
"associativity" : 4,
"replacement_policy" : "lru",
"coherence_policy" : "MESI",
"cache_line_size" : 64,
})
# Memory: 50ns access, 1GB
memctrl.addParams({
"clock" : "1GHz",
"backing" : "none", # We're not using real memory values, just addresses
"addr_range_end" : 1024*1024*1024-1,
})
memory = memctrl.setSubComponent("backend", "memHierarchy.simpleMem")
memory.addParams({
"mem_size" : "1GiB",
"access_time" : "50ns",
})
#########################################################################
## Declare links
#########################################################################
core_cache = sst.Link("core_to_cache")
cache_mem = sst.Link("cache_to_memory")
#########################################################################
## Connect components with the links
#########################################################################
core_cache.connect( (core, "cache_link_0", "100ps"), (cache, "high_network_0", "100ps") )
cache_mem.connect( (cache, "low_network_0", "100ps"), (memctrl, "direct_link", "100ps") )
################################ The End ################################
I ran strace -f
on ./hello
and ./hello-nompi
and saw that the clone
system call is called in the former and not in the latter.
$ cat hello.strace.out | grep clone
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f86d3dbfed0) = 2176972
[pid 2176972] clone(child_stack=0x7f6538d74fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176973], tls=0x7f6538d75700, child_tidptr=0x7f6538d759d0) = 2176973
[pid 2176972] clone(child_stack=0x7f6538165fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176974], tls=0x7f6538166700, child_tidptr=0x7f65381669d0) = 2176974
[pid 2176972] clone(child_stack=0x7f6537147fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176975], tls=0x7f6537148700, child_tidptr=0x7f65371489d0) = 2176975
[pid 2176971] clone(child_stack=0x7f86d0b3cfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176976], tls=0x7f86d0b3d700, child_tidptr=0x7f86d0b3d9d0) = 2176976
[pid 2176971] clone(child_stack=0x7f86c95e1fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176977], tls=0x7f86c95e2700, child_tidptr=0x7f86c95e29d0) = 2176977
I wonder if it would be useful to attach the pin tool after MPI_Init has already been called. We can look for the return of MPI_Init, then attach.
Proposed procedure:
- Add a function,
Ariel_MPI_Init
to the Ariel API that will be called immediately after MPI_Init returns (in the future, we can consider a more seamless way to do this, such as by catching PMPI_Init or with LD_PRELOAD) -
Ariel_MPI_Init
can accept the rank of each calling process, and the rank that will be traced (in the future, we want to be able to specify the rank that will be traced in the sdl file) - For the rank that will be traced, fork a pin process and tell it to trace the PID of the rank that is to be traced. (Caveat: for now, we are assuming everything runs on a single node)
- How does the pin process know what tunnel to attach to?
This issue will be fixed once https://github.com/sstsimulator/sst-elements/commit/40c140f38fdb0c60280ddb649f0c37ffa2dad4f3 is merged.
The solution is to add a callback for forks using PIN_AddForkFunction
and to call PIN_Disable()
from the child process.