gem5 icon indicating copy to clipboard operation
gem5 copied to clipboard

cpu-minor: Minor CPU load performance fix

Open ayc62 opened this issue 1 year ago • 3 comments

To preface, this is our first pull request! Please let us know if there is anything we are doing incorrectly.

I am currently testing the Minor CPU to understand if it is fit for our modeling purposes and am looking at a load benchmark with ~400 independent back-to-back LW instructions. When the Minor CPU is configured to be scalar, I expect that the average CPI should be 1; however, I've instead observed a CPI of 2. To be clear, I do not actually know if this is a performance bug, just that there's is a difference in the performance observed and expected. This CPI of 2 persists even when I set executeAllowEarlyMemoryIssue = True, and tag_latency = cache_latency = response_latency = 0 for the data cache.

Here is code snippet that you can run to observe the issue:

#define M5OP_INST(delay_reg, period_reg) \
    do { \
        asm volatile( \
            ".insn r 0x7b, 0x0, %[dump_reset], x0, x0, x0" \
            : \
            : [dump_reset] "i" (M5OP_DUMP_RESET_STATS), \
              "r" (delay_reg), \
              "r" (period_reg) \
            : \
        ); \
    } while(0)
 
#define LW_N_TIMES(N) \
    __asm__ volatile( \
        ".rept " #N "\n" \
          "lw t0, 0(%[rz]) \n" \
          "lw t1, 4(%[rz]) \n" \
          "lw t2, 8(%[rz]) \n" \
          "lw t3, 12(%[rz]) \n" \
          "lw t4, 16(%[rz]) \n" \
          "lw t5, 20(%[rz]) \n" \
        ".endr" \
        : : \
        [rz]"r"(z) \
    )
 
void lw(uint32_t* z) {
  register unsigned delay_reg __asm__("a0") = 0;
  register unsigned period_reg __asm__("a1") = 0;
 
  M5OP_INST(delay_reg, period_reg);
 
  LW_N_TIMES(64);
 
  delay_reg  = 0;
  period_reg = 0;
 
  M5OP_INST(delay_reg, period_reg);
 
}
 
int main( void )
{
 
  uint32_t z[256];
  lw(z);
  lw(z);
 
  return 0;
}

The first lw function call warms up the instruction cache, and the CPI of 2 can be observed with the second lw function call. My group has developed a line trace tool to display instructions at each stage of the CPU. From left to right, the following CPU stages are displayed: Fetch1, Fetch2, Decode, Issue, and Commit. Data cache requests and responses are shown to the right. A stall in a particular stage is signified with a '#'. The code example above results in the following trace behavior:

|    Fetch1     |  Fetch2  |    Decode      |   Issue  |  Commit  ||   DCache  >  DCache
|Cycle Num:     |    PC    |PC:opcode       |PC:opcode |PC:opcode ||  Requests > Responses
 
3126: #         |0x10146   |144:c_addi      |142:c_addi|140:c_addi||           >         
3127: #         |0x1014a   |146:lw          |144:c_addi|142:c_addi||           >         
3128: #         |0x1014e   |14a:lw          |146:lw    |144:c_addi||           >         
3129: #         |0x10152   |14e:lw          |14a:lw    |          ||0x00002b30 > 0x00002b30
3130: #         |#         |#               |#         |146:lw    ||           >         
3131: #         |0x10156   |152:lw          |14e:lw    |          ||0x00002b34 > 0x00002b34
3132: #         |#         |#               |#         |14a:lw    ||           >         
3133: #         |0x1015a   |156:lw          |152:lw    |          ||0x00002b38 > 0x00002b38
3134: #         |#         |#               |#         |14e:lw    ||           >         
3135: #         |0x1015e   |15a:lw          |156:lw    |          ||0x00002b3c > 0x00002b3c
3136: #         |#         |#               |#         |152:lw    ||           >         
3137: #         |0x10162   |15e:lw          |15a:lw    |          ||0x00002b40 > 0x00002b40
3138: #         |#         |#               |#         |156:lw    ||           >         
3139: #         |0x10166   |162:lw          |15e:lw    |          ||0x00002b44 > 0x00002b44
3140: #         |#         |#               |#         |15a:lw    ||           >         
3141: #         |0x1016a   |166:lw          |162:lw    |          ||0x00002b30 > 0x00002b30
3142: #         |#         |#               |#         |15e:lw    ||           >         
3143: #         |0x1016e   |16a:lw          |166:lw    |          ||0x00002b34 > 0x00002b34
3144: #         |#         |#               |#         |162:lw    ||           >         
3145: #         |0x10172   |16e:lw          |16a:lw    |          ||0x00002b38 > 0x00002b38

We would expect the CPI to be 1; however, as can be seen in the trace above, this is not the case and the backend has an occupancy of 2. The fix aims to achieve a CPI of 1 and modifies the file src/cpu/minor/execute.cc. After the commit limit has been reached, if the next instruction at the head of the inFlightInsts queue is a load then the load instruction is initiated.

The resulting behavior is then observed:

3088: #         |0x10146   |144:c_addi      |142:c_addi|140:c_addi||           >         
3089: #         |0x1014a   |146:lw          |144:c_addi|142:c_addi||           >         
3090: #         |0x1014e   |14a:lw          |146:lw    |144:c_addi||           >         
3091: #         |0x10152   |14e:lw          |14a:lw    |          ||0x00002b30 > 0x00002b30
3092: #         |0x10156   |152:lw          |14e:lw    |146:lw    ||0x00002b34 > 0x00002b34
3093: #         |0x1015a   |156:lw          |152:lw    |14a:lw    ||0x00002b38 > 0x00002b38
3094: #         |0x1015e   |15a:lw          |156:lw    |14e:lw    ||0x00002b3c > 0x00002b3c
3095: #         |0x10162   |15e:lw          |15a:lw    |152:lw    ||0x00002b40 > 0x00002b40
3096: #         |0x10166   |162:lw          |15e:lw    |156:lw    ||0x00002b44 > 0x00002b44
3097: #         |0x1016a   |166:lw          |162:lw    |15a:lw    ||0x00002b30 > 0x00002b30
3098: #         |0x1016e   |16a:lw          |166:lw    |15e:lw    ||0x00002b34 > 0x00002b34
3099: #         |0x10172   |16e:lw          |16a:lw    |162:lw    ||0x00002b38 > 0x00002b38

ayc62 avatar Aug 23 '24 20:08 ayc62

Hi @ayc62,

None of the commits in this pull request contains a Change-ID, which we require for any changes made to gem5. To automatically insert one, run the following:

f=.git/hooks/commit-msg
mkdir -p $f
curl -Lo $f https://gerrit-review.googlesource.com/tools/hooks/commit-msg
chmod +x $f

Then, amend the commit with git commit --amend --no-edit, and update your pull request. Thank you.

Additionally, we have a style guide for commit messages, and they must follow our guidelines. Please refer to a guide to contributing and search for "Committing" for more details.

ivanaamit avatar Aug 23 '24 21:08 ivanaamit

Just checking back on this pull request ... any input from the gem5 community if our analysis of this performance issue in the Minor CPU model is correct?

cbatten avatar Sep 21 '24 22:09 cbatten

Hi,

I'm not sure if I'm able to produce the same result above.

In my trace, the CPI for the loads is 1.

Screenshot 2024-09-22 at 12 51 29 AM

The workload I'm running is the same as the above one without the M5OP_INST.

This is the config that I used,

from gem5.components.boards.simple_board import SimpleBoard
from gem5.components.memory import SingleChannelDDR3_1600
from gem5.components.processors.cpu_types import CPUTypes
from gem5.components.processors.simple_processor import SimpleProcessor
from gem5.isas import ISA
from gem5.resources.resource import FileResource
from gem5.simulate.simulator import Simulator
from gem5.components.cachehierarchies.classic.private_l1_private_l2_cache_hierarchy import (
    PrivateL1PrivateL2CacheHierarchy,
)
from pathlib import Path

cache_hierarchy = PrivateL1PrivateL2CacheHierarchy(l1d_size="16MiB", l1i_size="16MiB", l2_size="256KiB")
memory = SingleChannelDDR3_1600(size="32MiB")
processor = SimpleProcessor(cpu_type=CPUTypes.MINOR, isa=ISA.RISCV, num_cores=1)
board = SimpleBoard(clk_freq="1GHz", processor=processor, memory=memory, cache_hierarchy=cache_hierarchy)
board.set_se_binary_workload(FileResource('/home/hn/gem5-1511/1511.bin'))
simulator = Simulator(board=board)
simulator.run()
print(
    "Exiting @ tick {} because {}.".format(
        simulator.get_current_tick(), simulator.get_last_exit_event_cause()
    )
)

hnpl avatar Sep 22 '24 08:09 hnpl

@ayc62 Hello, this PR is being closed due to inactivity, but please feel free to reopen it if you would like to continue working on it. Thank you!

erin-le avatar Feb 27 '25 20:02 erin-le