cva6
cva6 copied to clipboard
Add ALU to ALU bypass in superscalar mode
This PR introduce ALU to ALU bypass to allow dual issue of consecutive ALU instructions with a RAW dependency. The ALUs are now instantiated inside a wrapper module, which provides either a single ALU for the scalar configuration or two ALUs with bypass logic in the superscalar scenario. When two ALUs are instantiated, CPOP is carried out separately to avoid timing degradation.
At the moment the bypass is activated whenver SuperscalarEn=1. If a dedicated parameter is preferred to decouple the two features I can add it.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
- [x] Make the ALU-ALU forwarding parametric
- [x] Rebase on top of latest master branch
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
@JeanRochCoulon would it be possible to understand what is failing in the FPGA flow in the CI? I tried to run an FPGA synth locally and it arrives at the end, but I cannot reproduce the CI setup one-to-one.
A file is not available, this causes the FPGA build fail... I am looking at it.
Before investing too much engineering into this PR, I would like to double check the performance gain. On Dhrystone, the performance gain is 1%, quite low. And what about the Coremark?
I tried to run CoreMark with the same setup as the CI with and without ALU bypass. Without ALU bypass:
CoreMark/MHz 1.0 : 4.07 / GCC13.2.0 -O3 -g -fno-tree-loop-distribute-patterns -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -lgcc -funroll-all-loops -ffunction-sections -fdata-sections -Wl,-gc-sections -falign-jumps=4 -falign-functions=16 / UNSPECIFIED() RATIOS:1
With ALU bypass:
CoreMark/MHz 1.0 : 4.24 / GCC13.2.0 -O3 -g -fno-tree-loop-distribute-patterns -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -lgcc -funroll-all-loops -ffunction-sections -fdata-sections -Wl,-gc-sections -falign-jumps=4 -falign-functions=16 / UNSPECIFIED() RATIOS:1
Thus, there is a CoreMark improvement of ~4.17%.
Thank you, we see Coremark results are highly better than Dhrystone. But I again fear that the RTL modification is big in relation to the performance increase. With @cathales we tried to find solution to increase again the performance based on your PR. Do you thing the coverage of the fusion of ALU+BRANCH would be possible, this PR should not be far from supporting this.
Thank you for the feedback. The ALU+BRANCH the fusion is possible, but it would likely lead to a long combinational path ALU2 (not branch capable) ->ALU (branch capable) -> BRANCH UNIT. I will need to run some synthesis if we want to go down this path. At the moment the only legal chaining is ALU->ALU2, so this path would not pop up.
Just to clarify, your concern is about excessive RTL changes in terms of changed lines of code or area overhead for limited performance gain?
:x: failed run, report available here.
If ALU2 + ALU + BRANCH does not impact critical path, this PR would be a great improvement for cva6 (to be confirmed on Coremark). If not, let's discuss.
Today code is fully verified in cv32a60x configuration (100% code and functional coverages). Modifying lines impact maturity status, and constraint verification team to continue their work. But if performance is at top, everything is possible !
Thanks for the explanation. I will look into this change and let you know how the PPA is affected.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
:x: failed run, report available here.
Hi @JeanRochCoulon, @cathales. Sorry for the delay but I finally had some time to look into the suggested change to allow ALU->BRANCH UNIT.
The PR as of now only allows ALU->ALU forwarding, with a somewhat more complex issue logic to reorder instructions on the two alus to allow forwarding. The first ALU is the branch capable one, while the second one is the non-branch capable one.
The changes to allow ALU->BRANCH UNIT forwarding can be found here: https://github.com/ricted98/cva6/commit/c5d9dd03fcd25ebe1bf02f415ee807b107b216f2 (branch rt/alu-bypass-squashed). I am excluding JALR for the moment, as it would require additional muxing.
On CoreMark, I have observed a 1% improvement over the plain ALU-ALU bypass. However, the critical path is now on the ALU->ALU->BRANCH UNIT ->FLUSH path. I can give more detailed synthesis results in the upcoming hours.
Should I add the ALU->BRANCH UNIT commit to this PR as well despite the timing regression?