sst-elements
sst-elements copied to clipboard
Atomic memory accesses in multi-core simulations
Hello,
I'm having some issues with running multi-core simulations surrounding atomic memory accesses. I'm currently working with ARM's LLSC instructions and the simulation we're running deadlocks in an infinite access loop where SC instructions from multiple threads continually fail.
I've attempted to discern how such accesses are handled under the memHierarchy sst-element library with a MESI coherence protocol and I'm confused with the current implementation. Firstly, from what I can tell, if a cache line is shared, a write access will always miss the L1 cache to enforce MESI coherency and then there's an opportunity for an SC to fail on its path back towards the core (namely in MESIL1::handleGetXResp
). However, the isStoreConditional()
call identifies an SC as an event with the MemEventBase::F_LLSC
flag and cmd_
equal to Command::GetX
. We've used StandardMem::StoreConditional
to create our SC events, which seems to set cmd_
to Command::Write
; thus, it's never identified as an SC through isStoreConditional()
. It is possible that it's changed elsewhere but I cannot locate this.
Secondly, regarding the initial problem stated in this issue. If concurrent LLSC pairs are fired off to the memory hierarchy, there seems to be a pattern in which all SC atomic accesses can continually fail. For us, this is occurring due to the way in which the MESI coherence protocol is being enforced. I'll do my best to outline the behaviour below.
Take Thread A and B (running on two separate simulated cores) with both having executed an LL instruction at the same address. Thread A's SC instruction misses L1D as the accessed line is of MESI state S and hits L2 first, causing the line associated with Thread B's L1D to become invalidated, non-atomic (LLSCAtomic_ == false
), and causes Thread B's SC to subsequently fail. Thread A's SC succeeds and Thread B's SC retries (when Thread B's SC reaches L2, it hits an MSHR conflict condition and stalls) its access after the invalidation event has been resolved. During this time, Thread C (running on another simulated core) executes its LL instruction and begins its SC instruction, which again accesses the same address as Thread A\B's SC instructions. Thread C's LL instruction brings the correct cache line into Thread C's associated L1D and when Thread B's SC retires its access in the L2, it begins another invalidation event of the aforementioned cache line Thread C's LL has bought into L1D. This invalidation and setting of the cache line to be non-atomic subsequently causes Thread C's SC to fail. Thread C's SC will also execute a similar flow to Thread B's SC in that it will invalidate the atomic lines set by other threads' LL instructions from the L2 cache. This pattern continues indefinitely and deadlocks the simulation.
I'd appreciate any comments on this and possibly any misunderstandings regarding the flow of LLSC accesses through the memHierarchy sst-element library. For clarity, our simulation uses an in-house core simulator with each core connecting to a memHierarchy.Cache
, with a memHierarchy.coherence.mesi_l1
subcomponent, through a memHierarchy.standardInterface
. All L1D caches link to an L2 of type memHierarchy.Cache
, and with a coherence protocol of memHierarchy.coherence.mesi_inclusive
, via a memHierarchy.Bus
component. This issue triggers when running with 9 or more threads in our multi-threaded simulation of real-world micro benchmarks.
Which version of SST are you using? In the devel branch for example,
isStoreConditional
returns true if the request is a GetX
or Write
and the F_LLSC
flag is set. That said, I have been debugging an issue in LL-SC where it livelocks, so I agree there is a bug somewhere. Here's the current logic (head of devel branch). It is modeled after the RISC-V spec for LL/SC.
- MemH's StandardInterface converts the LL to a
Command::GetSX
(request to read a block in exclusive state) - Once the L1 has acquired the block in exclusive state, it sets the atomic flag on it and locks the line temporarily for some number of cycles (controlled by the
llsc_block_cycles
parameter). This prevents loss of the block so long as the SC arrives within the window. - If the SC arrives within the window, it succeeds
- An SC has the
F_LLSC
flag set and the command in MemH is eitherWrite
orGetX
.
- An SC has the
- Once the window expires, the block can be lost and livelock can occur. Things that typically cause loss of atomicity from core A:
- Regular write to the block by core A
- A write/GetX/LL/GetSX by core B
- A read/GetS by core B
- Eviction of the block from core A's L1
Different implementations of LL/SC may have different restrictions on what can occur between an LL and SC without jeopardizing forward progress. So it's possible there's a bug in memH preventing forward progress and/or that the memH LL/SC implementation doesn't agree with the expected ARM implementation. FWIW, if multiple variants of LL/SC can co-exist in memH, I think it's worth supporting them.
Thanks for the quick reply. Looks like our SST version is a bit out of date. The isStoreConditional
change mention matches a local change so that's good to hear it's been implemented as such.
The modelled LL/SC seems like it should match ARM (for our use cases anyway), there are some extra levels to do with acquire and release semantics but that's largely non-SST side logic for us so no reliance there for us.
Are the isStoreConditional
changes and LL/SC logic described above included in v13.0.0_Final or would it be in the latest commit on devel
only?
Also, to double-check, the above LL/SC logic is implemented for StandardMem::LoadLink
and StandardMem::StoreConditional
request types, not StandardMem::ReadLock
and StandardMem::WriteUnlock
.
The changes are in v13.0.0_Final (memEvent.h implementation of isStoreConditional
).
Correct on the logic I described being StandardMem::LoadLink
and StandardMem::StoreConditional
only. ReadLock
and WriteUnlock
define a non-conditional atomic are implemented differently.