riscv-perf-model Multiple LSU pipeline support

Hi Arup and Knute, after I talked with Arup, I decided to implement store buffer first to enable data forwarding and then implement multi pipelines. Here is the draft pull. For my implementation, 30% tests passed, 69 tests failed out of 99, when I look those results I feel really lost on how to begin debug. Could you give me some guidance? Thank you!

Nov 01 '24 17:11 MayneMei

@MayneMei Please reach out to Knute and see if he is available for a meeting to address your questions about debugging.

Nov 10 '24 10:11 arupc

@MayneMei I fixed the macOS build.

BTW I am going to release my VLSU branch, which may conflict with some of your LSU changes. Let me know if you have any trouble updating your branch.

Nov 25 '24 21:11 kathlenemagnus

When I added print message inside the LSU.cpp, I can see the store buffer size was modified. But in the test it's always 0. However, the cycle time met the expectation of data forwarding, so I assume it's correct

Dec 06 '24 08:12 MayneMei

@MayneMei my recent PR introduces some merge conflicts for you. If you have any trouble with the merge, please let me know and I can help resolve it.

Dec 17 '24 20:12 kathlenemagnus

@kathlenemagnus I resolved conflict. However during the regression test, there is an environment issue related to MacOS, and also test "BranchPred_test_Run" introduced a segfault. Other tests are all passed. Do you mind take a look at it? Thank you!

Jan 24 '25 18:01 MayneMei

@MayneMei it looks like you rebased your branch incorrectly. The PR contains all of the fixes that are already in master -- perhaps you rebased all of the updates since you pulled last?

Mar 23 '25 13:03 klingaard

@klingaard I reseted the commit and rebased it.

Mar 23 '25 19:03 MayneMei

Thanks for the suggestion! I should have a following PR in next week.

Knute Lingaard @.***>于2025年5月26日周一12:37写道：

@.**** approved this pull request.

Thanks for making the proper fix. We can move forward with this, but I do have a suggestion for improving the performance/readability of it. Feel free to merge and make a subsequent PR to change the alg if you want.

In core/lsu/LSU.cpp https://github.com/riscv-software-src/riscv-perf-model/pull/216#discussion_r2107781558 :
   std::vector<bool> coverage_mask(load_size, false);
   uint32_t bytes_covered_count = 0;
   // Iterate through the store_buffer_ from youngest to oldest.
   // This ensures that if multiple stores write to the same byte,
   // the data from the youngest store is effectively used.
   for (auto it = store_buffer_.rbegin(); it != store_buffer_.rend(); ++it)
   {
       const auto& store_info_ptr = *it; // LoadStoreInstInfoPtr
       const InstPtr& store_inst_ptr = store_info_ptr->getInstPtr();
       const uint64_t store_addr = store_inst_ptr->getTargetVAddr();
       const uint32_t store_size = store_inst_ptr->getMemAccessSize();
       if (store_size == 0) {
           continue; // Skip stores that don't actually write data.
Which ones are those?

In core/lsu/LSU.cpp https://github.com/riscv-software-src/riscv-perf-model/pull/216#discussion_r2107793832 :
   store_buffer_.push_back(store_info_ptr);
   ILOG("Store added to store buffer: " << inst_ptr);
}

bool LSU::tryStoreToLoadForwarding(const InstPtr& load_inst_ptr) const

{
   const uint64_t load_addr = load_inst_ptr->getTargetVAddr();
   const uint32_t load_size = load_inst_ptr->getMemAccessSize();
   // A load must have a non-zero size to access memory.
   if (load_size == 0) {
       return false;
   }
   std::vector<bool> coverage_mask(load_size, false);
This defaults to a specialization of std::vector which is really a dynamic std::bitset. Suggest using a boost::dynamic_bitset for the implementation. See my next points...

Since we don't have a concept of a store buffer class, which could maintain the valid bits per cache line, you can kinda simulate that here.

First, you can make a rule that store to load forwarding cannot cross a cache line. Makes things a little easier.

If the load addr + size is good create a bitmask of the bytes that are needed by the load based on the size of the cache line

Iterate the store buffer like you're doing

Create a bitmask for the store (like you did for the load) and clear the bits in the load bit mask that match:

load_bytes_needed &= ~store_bytes; if (load_bytes_needed.none()) { return true; // we found all the stores that can forward to this load within the cache line }

— Reply to this email directly, view it on GitHub https://github.com/riscv-software-src/riscv-perf-model/pull/216#pullrequestreview-2869063971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSAIYBUXLDB7WPTNGHL3M33ANUPLAVCNFSM6AAAAABRAUMZJKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQNRZGA3DGOJXGE . You are receiving this because you were mentioned.Message ID: @.*** com>

May 26 '25 19:05 MayneMei

@MayneMei do you have any updates to push?

Jul 07 '25 16:07 knute-mips

We'll go ahead and merge this. There are a few folks comping at the bit to enhance this block and are waiting for this PR.

Aug 01 '25 13:08 klingaard

riscv-perf-model riscv-perf-model copied to clipboard

Multiple LSU pipeline support

Thanks for making the proper fix. We can move forward with this, but I do have a suggestion for improving the performance/readability of it. Feel free to merge and make a subsequent PR to change the alg if you want.

Which ones are those?

riscv-perf-model
riscv-perf-model copied to clipboard