sst-elements icon indicating copy to clipboard operation
sst-elements copied to clipboard

Error: Received FetchInvX in unhandled state 'SM'

Open PhilippKasgen opened this issue 2 years ago • 4 comments

New Issue for sst-elements/memHierarchy

I'm using the MESI cache coherence protocol with one directory controller and two-level caches with three cores. The simulation unexpectedly stops after some time with this error message: FATAL: l2cache_0, Error: Received FetchInvX in unhandled state 'SM'. Event: <265,0> FetchInvX Src: dc0 Dst: l2cache_0 Rq: None Flags: [] Addr: 0x3c0/0x3fc (G) Data: F VA: 0x0 IP: 0x0 Size: 64 Prf: F. Time = 163ns

Is this behavior intended? If not, I can try and strip the traces for a minimal example.

PhilippKasgen avatar Apr 11 '22 15:04 PhilippKasgen

@PhilippKasgen That's definitely not intended! If you can get a minimal example that would be great, but it's likely some uncommon race in the coherence protocol so may be hard to find a smaller reproducer. Can you provide the cache/memory system configuration from your input file? In particular, are the L2s private or shared? And are they inclusive (that's the default if you didn't specify anything)?

gvoskuilen avatar Apr 12 '22 22:04 gvoskuilen

Thank you for quick response! The L2s are private and inclusive.

What I know so far is that two cores are writing to the same cache line. The DC sends the FetchInvX to l2cache_0 when l2cache_2's write is accepted. Unfortunately, l2cache_0 is already in SM state. Since l2cache_0 needs to handle this but its write data should not be discarded it should change to IM state and send its write request to the DC soon after. But I don't know if anything else needs to be managed inside L2 along this transition.

PhilippKasgen avatar Apr 13 '22 15:04 PhilippKasgen

Can you double check that the L2 caches and the directory have their coherence parameter set to MESI right? A mismatch could cause this. If the config is OK, it sounds like there's an issue with why l2cache_0 is in SM state. If the directory is sending a FetchInvX, it thinks l2cache_0 already has write permissions on the block. I will trace through the transitions and see if I can identify why this scenario happens. Which CPU model are you using? I'm wondering if there are Flush instructions involved or if it's just read/write.

Another way to approach debugging this would be to generate a coherence trace. It may be time-consuming to go through (the trace can be quite large), so I'll see how far I can get with just looking through the protocol. If you want to try it, configure sst-core with "--enable-debug" and recompile sst-elements. After recompiling, add the parameters '"debug": 1' and '"debug_level" : 5' (or 6 for slightly more info) to the L2s and directory. The trace prints to stdout.

gvoskuilen avatar Apr 13 '22 17:04 gvoskuilen

You can take the same setup as in #1934 and use out.txt as described there. Add 10 to all mshr_latency_cycles to trigger this unhandled state.

PhilippKasgen avatar Sep 19 '22 15:09 PhilippKasgen