sst-elements icon indicating copy to clipboard operation
sst-elements copied to clipboard

Wrong instruction ordering when simulating MiniMD with Ariel

Open plavin opened this issue 1 year ago • 8 comments

Explanation

When simulating MiniMD with Ariel, after the simulation has run for several minutes, it will eventually reach a point where the arielcore receives an instruction it does not expect. This error occurs when an unexpected instruction occurs after an ARIEL_START_INSTRUCTION.

After the core sees ARIEL_START_INSTRUCTION, it expects either ARIEL_PERFORM_READ, ARIEL_PERFORM_WRITE or ARIEL_END_INSTRUCTION. Sometimes, it will instead receive an ARIEL_NOOP, or an additional ARIEL_START_INSTRUCTION, crashing the simulation here. After examining ariel/frontend/pin3/fesimple.cc, it seems it should not be possible for either of those instruction orderings to occur.

Example

The command that breaks this is non-determinstic, but here is an example ordering that crashes the simulation. The unique IP bits are included.

Instruction                IP
ARIEL_START_INSTRUCTION    0x4b09
ARIEL_NOOP                 0x2d82   # Crash here
ARIEL_PERFORM_READ         0x4b09

Notably, a READ that seems to be associated with the unmatched START arrives after a NOOP.

After some tinkering, I also encountered a scenario in which the program crashes here because a READ arrives before its corresponding START. This seems to indicate that the error is not limited to NOOP instructions.

This to me indicates some sort of race condition, meaning this may be an issue with ArielTunnel or its parent classes.

More Info

This in a single-threaded execution. SST is running on a single thread, and the traced program is using a single thread. The server is an Intel Ice Lake server running RHEL 8.7. I'm running the MiniMD ref version. The commits I'm using are:

  • SST-Elements https://github.com/plavin/sst-elements/commit/b6e9d9c761677b9caa24dc4a2c15c2a8fbe00d21 (branched from https://github.com/sstsimulator/sst-elements/commit/f5fcb56a3abb88dfaf661e0a31df21de7b172a55)
  • SST-Core https://github.com/sstsimulator/sst-core/commit/e70e231f097c24c5f9d6b4ec5d1b3cd217a5f6a4

If someone could try replicating this, I would be very appreciative. The SST-Elements commit I linked to will output the instructions before and after the one that crashes the simulation.

plavin avatar Apr 21 '23 22:04 plavin

The above was configured with Pin 3.23. I have confirmed that this persists with the most recent commits to the master branches:

  • https://github.com/sstsimulator/sst-core/commit/7870ad6251c6ec68a3795ae1f5caa336f1090b27
  • https://github.com/sstsimulator/sst-elements/commit/0c07b17363b702ec51486af9055762ff14a6fc5b
  • Pin 3.26 (The newest supported version)

plavin avatar Apr 23 '23 01:04 plavin

It seems that if Ariel is unmodified, the error always occurs because of an instruction occurring in between a START and an END that shouldn't be there, such as another START or a NOOP. The other scenario I mentioned in the original post (about a READ occurring before a START) seems to just be an artifact of my changes.

Perhaps this means the error is just in the frontend code and not in the tunnel.

plavin avatar Apr 23 '23 02:04 plavin

Dug into this a bit more. Replicated the issue with the following settings:

  1. ran on our testing server
  2. recent commits of elements and core
  3. ariel_snb.py
  4. MiniMD ref, openmpi, SIMD=no, DEBUG=yes

I traced calls to the tunnel. I found the following data was written to the tunnel. In the brackets you'll see the sequence of commands written, and in parentheses you see the issue. Legal instruction orderings, according to arielcore.cc, are [32 (2, 4)*, 64]. That is, [START, (READ, WRITE)*, END]. But we are seeing 128 (NOOP) following 32 before a 64, and seeing 32 follow a 32 before a 64. (While the core will allow for reads and writes as part of the same instruction, fesimple will only produce reads or writes.)

[ 32 128   4 128  64]   (128 between 32 and 64)
[32 32  2  4 64 64]     (two instructions interleaved)
[ 32 128   4  64]       (128 between 32 and 64)

So the issue is with how tunnel->writeMessage is called by the frontend. These instruction orderings should not be possible, based on how the frontend is written. Perhaps the bug is in Pin?

plavin avatar Aug 26 '23 17:08 plavin

Is it possible that the inserted calls to writeMessage get rearranged somehow?

If I were to monitor what was being written to the tunnel, I could identify when an error occurs. Would it be possible to get a stack trace from miniMD at this point so I can see which function is causing the issue?

plavin avatar Aug 26 '23 17:08 plavin

This seems to be related to including mpi.h in the program, even if it isn't launched with mpirun. I ran into this issue again with the following program:

#include <cstdio>
#include <cstdlib>
#include <mpi.h>

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);
    int rank = 0;
    int ret  = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (ret != MPI_SUCCESS) {
        printf("Error: MPI_Comm_rank retuned error: %d\n", ret);
        exit(1);
    }
    printf("Hello from rank %d!", rank);

    for (int i = 1; i < argc; i++) {
        printf(" -- %s", argv[i]);
    }
    printf("\n");

    MPI_Barrier(MPI_COMM_WORLD);
    if (rank == 0) {
        printf("Rank 0: Barrier complete.\n");
    }

    MPI_Finalize();
}

If you remove the MPI components, the error goes away.

Here is the sdl file for completeness:

import sst

#########################################################################
## Define SST core options
#########################################################################
# If this demo gets to 100ms, something has gone very wrong!
sst.setProgramOption("stop-at", "100ms")


#########################################################################
## Declare components
#########################################################################
core = sst.Component("core", "ariel.ariel")
cache = sst.Component("cache", "memHierarchy.Cache")
memctrl = sst.Component("memory", "memHierarchy.MemController")

#########################################################################
## Set component parameters and fill subcomponent slots
#########################################################################
# Core: 2.4GHz, 2 accesses/cycle, STREAM (triad) pattern generator with 1000 elements per array
core.addParams({
    "clock" : "2.4GHz",
    "verbose" : 1,
    #"executable" : "./hello-nompi"
    "executable" : "./hello"
})


# Cache: L1, 2.4GHz, 2KB, 4-way set associative, 64B lines, LRU replacement, MESI coherence
cache.addParams({
    "L1" : 1,
    "cache_frequency" : "2.4GHz",
    "access_latency_cycles" : 2,
    "cache_size" : "2KiB",
    "associativity" : 4,
    "replacement_policy" : "lru",
    "coherence_policy" : "MESI",
    "cache_line_size" : 64,
})

# Memory: 50ns access, 1GB
memctrl.addParams({
    "clock" : "1GHz",
    "backing" : "none", # We're not using real memory values, just addresses
    "addr_range_end" : 1024*1024*1024-1,
})
memory = memctrl.setSubComponent("backend", "memHierarchy.simpleMem")
memory.addParams({
    "mem_size" : "1GiB",
    "access_time" : "50ns",
})

#########################################################################
## Declare links
#########################################################################
core_cache = sst.Link("core_to_cache")
cache_mem = sst.Link("cache_to_memory")


#########################################################################
## Connect components with the links
#########################################################################
core_cache.connect( (core, "cache_link_0", "100ps"), (cache, "high_network_0", "100ps") )
cache_mem.connect( (cache, "low_network_0", "100ps"), (memctrl, "direct_link", "100ps") )

################################ The End ################################

plavin avatar Apr 02 '24 22:04 plavin

I ran strace -f on ./hello and ./hello-nompi and saw that the clone system call is called in the former and not in the latter.

plavin avatar Apr 02 '24 22:04 plavin

$ cat hello.strace.out | grep clone
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f86d3dbfed0) = 2176972
[pid 2176972] clone(child_stack=0x7f6538d74fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176973], tls=0x7f6538d75700, child_tidptr=0x7f6538d759d0) = 2176973
[pid 2176972] clone(child_stack=0x7f6538165fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176974], tls=0x7f6538166700, child_tidptr=0x7f65381669d0) = 2176974
[pid 2176972] clone(child_stack=0x7f6537147fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176975], tls=0x7f6537148700, child_tidptr=0x7f65371489d0) = 2176975
[pid 2176971] clone(child_stack=0x7f86d0b3cfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176976], tls=0x7f86d0b3d700, child_tidptr=0x7f86d0b3d9d0) = 2176976
[pid 2176971] clone(child_stack=0x7f86c95e1fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2176977], tls=0x7f86c95e2700, child_tidptr=0x7f86c95e29d0) = 2176977

I wonder if it would be useful to attach the pin tool after MPI_Init has already been called. We can look for the return of MPI_Init, then attach.

Proposed procedure:

  1. Add a function, Ariel_MPI_Init to the Ariel API that will be called immediately after MPI_Init returns (in the future, we can consider a more seamless way to do this, such as by catching PMPI_Init or with LD_PRELOAD)
  2. Ariel_MPI_Init can accept the rank of each calling process, and the rank that will be traced (in the future, we want to be able to specify the rank that will be traced in the sdl file)
  3. For the rank that will be traced, fork a pin process and tell it to trace the PID of the rank that is to be traced. (Caveat: for now, we are assuming everything runs on a single node)
  4. How does the pin process know what tunnel to attach to?

plavin avatar Apr 02 '24 23:04 plavin

This issue will be fixed once https://github.com/sstsimulator/sst-elements/commit/40c140f38fdb0c60280ddb649f0c37ffa2dad4f3 is merged.

The solution is to add a callback for forks using PIN_AddForkFunction and to call PIN_Disable() from the child process.

plavin avatar Apr 03 '24 18:04 plavin