Doubts regarding the working of olympia, issue queue and pipe targets.
I am going through the tutorial of olympia project for risc-v performance modelling, but I am having some difficulty understanding the structure of the cores, pipelines and execution units.
-
First of all how are small_core, medium_core, big_core different? I checked their yaml definitions, and the first thing I noticed was they are 2_wide, 3_wide and 8_wide respectively. What does x in x_wide specifically mean here? Does it mean that an x_wide machine has a pipeline with x stages? or does it mean that the machine has x number of execution units (for example 6 alus and 2 branch units)? or does it mean something different?
-
Then going few lines down in the yaml file, the pipe targets for each execution unit is set, what does is this signify? My understanding is that it maps the types of instructions to the types of execution unit which is required to carry out those type of instructions, for example a simple addi instruction maybe mapped to a simple "int" unit and so on. Is my understanding correct?
-
Going down, the number of units, per queue is set. This is the code -
issue_queue_to_pipe_map:
[
["0"], # iq0 -> exe0
["1"], # iq1 -> exe1
["2"], # iq2 -> exe2
["3"], # iq3 -> exe3
]
What does "issue queue" here signify? What kind of a queue is this?
I am familiar with a 5-stage RISC pipeline, but I am having some difficulty understanding the pipeline stages as defined in the Spart modelling framework. For example what does terms like "issue queue" and "pipe targets" even mean. I went through the Sparta core example, but I couldn't understand the above terms I mentioned.
Can someone explain it, or share some resources to get a better understanding?
Hi @dragon540,
- The 2, 3, and 8 wide definitions are referring to the fetch, decode, rename, dispatch, and retire widths, so how many instructions can be decoded, fetched, and so on per cycle.
- Yes you're assumption is correct. A target pipe is a type of instruction grouping and can be split into: integers, floating point, branch, vector, and load store unit. Each instruction in the ISA will map to one of those groupings. Additional groupings can be defined such as for vector you can have vint, vlsu, and groupings for complex vector instructions.
- An issue queue is a queue used to hold units before the execution stage. In this design, issue queues can be used to hold instructions before being fed into a number of execution units assigned to it. So we could have one issue queue that feeds into 2 ALUs, so every cycle 2 instructions can be popped out. The main reason why we have the issue queues is to be able to map certain target pipe instructions to certain ALUs. If lets say we have a bottleneck around integer operations, to solve this we would maybe set up a dedicated integer ALU where the issue queue would only be filled with integer operations and not get bottlenecked by a slow divide operation.
Explaining the meaning of this:
pipelines:
[
["sys"], # exe0
["int", "div"], # exe1
["int", "mul"], # exe2
["int", "mul", "i2f", "cmov"], # exe3
["int"], # exe4
["int"], # exe5
["float", "faddsub", "fmac"], # exe6
["float", "f2i"], # exe7
["br"], # exe8
["br"], # exe9
["vint", "vset", "vdiv", "vmul"]
]
issue_queue_to_pipe_map:
[
["0", "1"], # iq0 -> exe0, exe1
["2", "3"], # iq1 -> exe2, exe3
["4", "5"], # iq2 -> exe4, exe5
["6", "7"], # iq3 -> exe6, exe7
["8", "9"], # iq4 -> exe8, exe9
["10"] # iq5 -> exe10
]
The above is for the "big_core" design. So if you take the first line of "issue_queue_to_pipe_map": ["0, "1"] means look at indexes 0 and 1 of the "pipelines" definition above which states ["sys"] and ["int", "div"]. This means that issue queue 1 can take target pipes sys, int, and div. It also says that execution 0 is for sys instructions and execution unit 1 is for ints and div (when we say execution unit we mean ALU unit). This goes on for each line of the issue queue, so we're defining for each issue queue, what target pipes it has and what target pipes each execution unit can operate on. All this complexity is to better utilize resources to avoid bottlenecks, because instead of having every ALU be able to handle everything, we can define certain ALUs accordingly based on the workload profile we are optimizing for.
@klingaard please feel free to correct me if I got anything wrong.
I'm going to close this issue as resolved. If further discussion is required, please open a discussion and not an issue.