cva5 icon indicating copy to clipboard operation
cva5 copied to clipboard

Interrupts can lead to MEPC inconsistent with register state / actually retired operations or pointing to entirely illegal instructions

Open ramonwirsch opened this issue 2 years ago • 6 comments

When an interrupt, such as a timer interrupt occurs, the captured MEPC does not always reflect the correct address where to continue execution after the interrupt finishes. This can randomly lead to significant corruption and crashes.

I have observed the following 3 inconsistencies::

  • CSR modifications that were already started will always finish, yet MEPC can point to the CSR operation (so as not-executed,to resume here after interrupt)
  • MEPC can sometimes capture a normal (register-write) instruction that still retires (when the retiring of said instruction falls into the same cycle as the capturing. Same as with CSR-modifications, if the respective operation uses the modified output as input (addi a0, a0, 1), erroneous re-execution leads to broken state / MEPC is inconsistent with the register state.
  • addresses of prefetched instructions, yet not ever executed / possibly illegal instructions can be captured

Example for the 3rd case:

jr [label]
[illegal instruction]

The illegal instruction will be fetched and maybe decoded prior to CVA5 executing the jump and flushing. Gc_unit does not prevent the address from being captured into MEPC when an interrupt hits at the right time.

I have fixes for all 3 cases applied in my fork of CVA5, but my fork includes other changes, such as an FPU, reworked verilator implementation and for example modifying the timer interrupt input to match the RV-Privileged spec (#20). I have also written a test program that recreates all 3 cases for my fork and specific CVA5 configuration, using a timer implemented according to the RV-Priviliged spec. The benchmark arms the timer and expects it to request an interrupt at a specific cycle (and therefor instruction in the benchmark) that is still influenced by the branch predicator (I am using my production CVA5 config to run it).

I am also happy to share said benchmark and timer, but I have only tested them in my environment, which uses my own Picolibc build, runtime/board support library, CMake configuration, new binary input format and conversion tool for the verilator-simulator. So said test-code would probably need to be refactored into a test with specific, more consistent processor config and to only use the vanilla code base.

To outline the scope of fixes I have applied:

  • decouple MEPC capture from interrupt_taken signal
  • delay MEPC capture depending on if a instruction is currently being retired, ongoing CSR modification
  • changed oldest_pc from metadata_id block to next_retiring_insn_pc which applies branch_flush events and updates itself to not include illegal or unreachable addresses.

This was the simplest solution to fix the corruption and is probably not the most efficient solution to implementing interrupts. Looking at the timings it seems to be, every instruction that was already issued pre-interrupt could just retire normally instead of being rolled back (unlike when handling jumps/branches). Since we have to wait for all issued ops to retire anyways, it is wasteful to throw that state away (and can also take more cycles). Then some of my modifications would no longer be needed. But I found it too difficult to understand all the ways in which gc_unit needs to stay synchronized to other parts of the core to attempt that modification myself / expected this refactoring to be more work.

I am afraid I do currently not have the time to rebase my changes and test them without my code base so that I could provide a simple, conflict-free and tested patch-set.

ramonwirsch avatar Nov 02 '23 15:11 ramonwirsch

Just curious, does this issue occur using the accelerators-2023 branch? I know a bunch of changes were made relating to various things (like decoding and CSRs), and wonder if the issue persists.

CKeilbar avatar Nov 16 '23 22:11 CKeilbar

I did not port my changes over to a new branch, without knowing why it has not been merged, just to test if it fixes some issues.

But a cursory look over the code does not make it look like the behavior of gc_unit, csr_unit, load_store_unit or oldest_pc in id_block has changed when it comes to the issues I have found. oldest_pc can still contain invalid addresses and gc_unit is still attempting to "suppress" / cancel already issued operations with disregard to whether those already issued operations have their own state that was already affected and cannot be reverted without breaking the processor state. #22 also affects this.

ramonwirsch avatar Nov 17 '23 09:11 ramonwirsch

After delving deeper into #22, I have now concluded, that the fault lies not with the handling of peripheral loads. Those do not need to be committed separately, as long as CVA5 can guarantee to apply branches, jumps and exceptions within 1 cycle of issuing the causing instruction (as it currently does. I have not checked whether TLBs would change that). That stores are not committed immediately but only at retiring just makes it look like the processor expects that already issued stores may need to be rolled back, even though, if that is done, peripheral loads really should be handled in the same way. But ultimately, gc_unit / interrupt behavior is what is broken. It must never attempt to roll back / suppress an already issued instruction that is not throwing an exception, because it can never know what is already committed and what is not.

My above intuition, that simply letting post-issue instructions retire would be faster is not only that, but required for correctness. And the code that exists to handle the same for exceptions must never become active for the same reason. There seems to be one use for the suppress_writeback signal and that is causing renamer to roll back the register-mapping of the failing instructions result. But all other uses are either irrelevant or cannot work. The post-issue-discard phase could be shortened for only a single retiring instruction (the one causing the exception, if it had a result). And it seems only needed at all when post-issue-drain did not already let everything retire.

ramonwirsch avatar Nov 17 '23 20:11 ramonwirsch

First of all, thank you very much for your detailed description and insight!

I think this state logic has suffered a bit do to changing pipeline behaviours over time. Right now, as you've found, some of the architecturally visible changes (eg. CSR commits) are happening too soon in the exception/interrupt process. I'll look at restructuring this logic to better separate out the various pipeline control events.

At one point there were thoughts about offering two styles of interrupt handling: Take interrupt immediately and discard Allow issued instructions to complete before taking the interrupt You are correct that, given the current renaming scheme, we have to wait for issued instructions to complete regardless of whether they are being discarded. Thus, the first approach isn't useful, however, there is a complication with the possibility of an exception occurring on one of the post-issued instructions.

At present, all exceptions are caught within the first cycle post-issue (and thus able to prevent additional instructions from being issued) with the exception of future support for TLB exceptions, which can occur any number of cycles later.

Updated plan for interrupt/exception behaviour: Interrupt:

  • flush pre-issue, update pc
  • begin fetching, but hold at decode
  • wait for post-issue instructions to complete (optionally: could wait just for instructions that could cause an exception)
  • if a post-issue exception occurs
    • switch to exception path
  • else
    • update CSRs
    • allow decode to resume

Exception:

  • flush pre-issue, update pc
  • begin fetching, but hold at decode
  • hold until post-issue drained
  • update CSRs
  • allow decode to resume

e-matthews avatar Nov 19 '23 17:11 e-matthews

As a followup on the issue of I/O and exceptions, long term this will need more complex support. I'm currently thinking that we should have a list of I/O regions and then for any load to an I/O region it will be held until it's the oldest instruction before the request is sent to memory. (In terms of implementation, it can either be tagged when placed in the load-queue or detected when it reaches the head of the load-queue. From there, we must hold it until it is the oldest post-issue instruction. The holding can be done in the load-queue (simplest), sub-units (passes on the complexity), or with a separate queue for re-ordering (best throughput/performance).

The broader question is how to best express the memory layout for the processor. We need the memory range for each unit, I/O zones within each larger memory range, and potentially AMO zones as well (as even once that is fixed, it is likely to be for cacheable memory only).

I will think on a clean way of implementing this. As there are some advantages if we can pack all of a memory sub-unit's config options and ranges together. By doing so they can be pulled out of the fetch and load-store-unit modules, simplifying them and making it easier for others to add new memory sub-units themselves.

e-matthews avatar Nov 19 '23 19:11 e-matthews

Regarding the need to potentially abort / delay interrupting, due to already issued instructions requiring exceptional behavior: This kind of race condition is already in the processor, if you receive an interrupt while trying to disable global interrupts via CSR. I just hit this situation and had to write a small test to reproduce it. While this is happening with my changes, I think it should also apply to vanilla CVA5 already and deadlock the CPU.

Regarding the IO commit: Before I realized that the explicit post-issue commit was unnecessary for the current CVA5, I already had a basic implementation running, that would allow explicit commit of memory loads.

For AMO support I already had added support to fuse both load-queue and store-queue results, as AMO ops need fields that right now are split across both queues (data-field & id and other bits) and it also ensures memory ordering (although stricter than RISC-V demands).

So adding support for IO loads to also be put into the store-queue and wait on signal from id_block was rather easy. I'v solved this by moving the address-range and subunit selection in front of the queues, so I can recognize whether it is an IO load or normal load (by the selected subunit signalling whether loads are potentially destructive). Any load exiting the lsq is signalled as committed to id_block, which needed additional tracking and an additional counter for post-commit operations (as subset of post-issue) so that gc_unit can wait for those to finish explicitly.

But due to the bad performance of IO loads I have now disabled that behavior for the peripheral bus again. Made me wonder if it would be a good strategy for id_block to count operations that can have a post-issue controlflow impact so that loads and stores sufficiently far away from instructions that might throw an exception or branch can be committed early enough to not cause additional delays on average.

AMO zones as well (as even once that is fixed, it is likely to be for cacheable memory only)

Do you have references from production-level processors how they handle this? Is it really handled in a way that code running from local memory (like bootloader and bootstrapping code) must not use atomic primitives, even in multi-threaded situations and resolve to simply locking out interrupts around every implementation of C-level atomics? I only looked to GCCs implementation, that does not support this at all without the AMO extension being present and extrapolated from that, that AMO should be supported on local memory as well for code to be sensibly portable. Excluding local memory would also enforce a strict separation between code running on local memory and cacheable memory, where you cannot even have callbacks to local memory, as C-level atomic implementations would not be compatible with one another, even if you'd compile them for a different instruction set and supplied your own atomic implementations. Unless one would use local memory as pure instruction memory/ROM, requiring that the cached memory is available immediately without any complex setup in multi-core situations.

ramonwirsch avatar Nov 21 '23 10:11 ramonwirsch