riscv-perf-model icon indicating copy to clipboard operation
riscv-perf-model copied to clipboard

VLSU Support

Open aarongchan opened this issue 1 year ago • 3 comments

PR Goal: Implement VLSU instruction support. Current implementation uses the LSU design and adds vector iterations based on VLEN and data width of VLSU.

aarongchan avatar Jul 18 '24 05:07 aarongchan

@kathlenemagnus If you could look at the VLSU.cpp (specifically the completeInst_) for context for the rest of this comment when you get the chance. I've updated my code to work around how the LSU handled stores, but the issue I was running into and had to work around can be summarized as:

  • Stores get retired before "completing" in the LSU. This is an issue because:
    • We need to send multiple memory requests through. If on vector store has 16 passes needed, so 16 memory operations, we need to send that many through. However, the wrappers for LoadStoreInfo and MemoryAccessInfo, which contain relevant data to cache, TLB, and MMU status include the InstPtr. This becomes an issue if we have multiple requests, because the InstPtr already has a retired status, but the next requests will try to set the status thus resulting in an error.
    • The work around I did was to create a separate status per each LoadStoreInfo wrapper, thus each memory request has it's own status, that is separate from the instruction's. A couple questions:
  • Currently in my design, the memory requests are sent through the VLSU serially, so we have to wait for the first pass to finish before the 2nd pass can begin for the same vector instruction. This is due to the current design of the LSU, because we're only assuming one memory request per instruction, which is the case for LSU, but not VLSU. The benefit of this is that if there are multiple vlsu instructions, they fully utilize the pipeline because while instruction A's memory request is waiting for the cache, instruction B's memory request is at a different stage of the vlsu pipeline. My question is does this type of design make sense? Or should rearchitect this to work differently similar to the uop generator in vlsu that Knute had suggested?
  • The vlsu uop generator idea would only process one vlsu instruction at a time, but generate lets say 16 memory requests that then queue up in the VLSU. The flow would look like:
vlsu_uop_generator -> generates 16 memory requests for instruction A
vlsu_ldst_queue_ -> takes the memory requests, queues them up, begins processing them through the pipeline
vlsu -> once all memory requests are through, we officially retire instruction A
(repeat for all vlsu instructions)
vlsu_uop_generator -> generates 16 memory requests for instruction B
...

I wanted to check all this with you and Knute before proceeding, because it will change a bit of how the VLSU works compared to the LSU, especially around the setting of instruction status vs creating a LoadStoreInfo status due to the nature of VLSU instructions.

aarongchan avatar Jul 23 '24 15:07 aarongchan

The work around I did was to create a separate status per each LoadStoreInfo wrapper, thus each memory request has it's own status, that is separate from the instruction's.

I think this is the right solution. Each memory access should be completed separately and then a vector instruction can be marked completed if all of its memory accesses have completed.

Currently in my design, the memory requests are sent through the VLSU serially, so we have to wait for the first pass to finish before the 2nd pass can begin for the same vector instruction. This is due to the current design of the LSU, because we're only assuming one memory request per instruction, which is the case for LSU, but not VLSU. The benefit of this is that if there are multiple vlsu instructions, they fully utilize the pipeline because while instruction A's memory request is waiting for the cache, instruction B's memory request is at a different stage of the vlsu pipeline. My question is does this type of design make sense? Or should rearchitect this to work differently similar to the uop generator in vlsu that Knute had suggested?

I would rearchitect like Knute suggested. What we want is for a vector instruction to be able to generate multiple memory accesses. They can be executed serially, but they should be able to be pipelined. So 1 vector instruction should be able to send a uop down the LSU pipeline every cycle. Similarly to how we used to track whether a parent instruction was "done" in the ROB, a vector load or store should expect multiple memory accesses to be completed before it can be marked as done. One thing to be careful of though is that there should only be 1 writeback to the vector destination. So you will need some sort of structure for collecting all of the data returned to the LSU.

The vlsu uop generator idea would only process one vlsu instruction at a time, but generate lets say 16 memory requests that then queue up in the VLSU. The flow would look like:

vlsu_uop_generator -> generates 16 memory requests for instruction A vlsu_ldst_queue_ -> takes the memory requests, queues them up, begins processing them through the pipeline vlsu -> once all memory requests are through, we officially retire instruction A (repeat for all vlsu instructions) vlsu_uop_generator -> generates 16 memory requests for instruction B

I wanted to check all this with you and Knute before proceeding, because it will change a bit of how the VLSU works compared to the LSU, especially around the setting of instruction status vs creating a LoadStoreInfo status due to the nature of VLSU instructions.

Let's talk about this more on Monday. I see two paths forward here:

  1. We could do as you suggest and "sequence" the vector load store instruction when it gets to the LSU and generate a memory request that is stored in the LDST queue. These memory requests would behave like scalar instructions in the queue. The question here is when to trigger WB. It's inefficient to writeback to the vector destination 16 times and it is not always possible to do partial writes to the register file.
  2. Another option would be to keep the vector instruction in the LDST queue and send it down the LSU pipe multiple times. Each time it gets sent down the pipe it would generate a different memory access. Again, we need a way to track all of the data that has come back and make sure it's all ready before sending the vector instruction down the pipe a final time to do writeback.

kathlenemagnus avatar Jul 28 '24 19:07 kathlenemagnus

@klingaard @kathlenemagnus this should be ready for review/merge.

aarongchan avatar Aug 05 '24 03:08 aarongchan

Closing due to @kathlenemagnus's PR

aarongchan avatar Nov 08 '24 23:11 aarongchan