cva6 Add ALU to ALU bypass in superscalar mode

This PR introduce ALU to ALU bypass to allow dual issue of consecutive ALU instructions with a RAW dependency. The ALUs are now instantiated inside a wrapper module, which provides either a single ALU for the scalar configuration or two ALUs with bypass logic in the superscalar scenario. When two ALUs are instantiated, CPOP is carried out separately to avoid timing degradation.

immagine

At the moment the bypass is activated whenver SuperscalarEn=1. If a dedicated parameter is preferred to decouple the two features I can add it.

May 03 '25 13:05 ricted98

:x: failed run, report available here.

May 03 '25 13:05 github-actions[bot]

:x: failed run, report available here.

May 05 '25 10:05 github-actions[bot]

:x: failed run, report available here.

May 05 '25 10:05 github-actions[bot]

:x: failed run, report available here.

May 05 '25 10:05 github-actions[bot]

:x: failed run, report available here.

May 05 '25 10:05 github-actions[bot]

:x: failed run, report available here.

May 05 '25 17:05 github-actions[bot]

:x: failed run, report available here.

May 07 '25 07:05 github-actions[bot]

[x] Make the ALU-ALU forwarding parametric
[x] Rebase on top of latest master branch

May 14 '25 06:05 ricted98

:x: failed run, report available here.

May 14 '25 09:05 github-actions[bot]

:x: failed run, report available here.

May 16 '25 07:05 github-actions[bot]

:x: failed run, report available here.

May 16 '25 10:05 github-actions[bot]

:x: failed run, report available here.

May 16 '25 13:05 github-actions[bot]

:x: failed run, report available here.

May 16 '25 13:05 github-actions[bot]

:x: failed run, report available here.

May 19 '25 10:05 github-actions[bot]

:x: failed run, report available here.

May 19 '25 13:05 github-actions[bot]

@JeanRochCoulon would it be possible to understand what is failing in the FPGA flow in the CI? I tried to run an FPGA synth locally and it arrives at the end, but I cannot reproduce the CI setup one-to-one.

May 19 '25 15:05 ricted98

A file is not available, this causes the FPGA build fail... I am looking at it.

May 19 '25 21:05 JeanRochCoulon

Before investing too much engineering into this PR, I would like to double check the performance gain. On Dhrystone, the performance gain is 1%, quite low. And what about the Coremark?

May 19 '25 21:05 JeanRochCoulon

I tried to run CoreMark with the same setup as the CI with and without ALU bypass. Without ALU bypass:

CoreMark/MHz 1.0 : 4.07 / GCC13.2.0 -O3 -g -fno-tree-loop-distribute-patterns -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -lgcc -funroll-all-loops -ffunction-sections -fdata-sections -Wl,-gc-sections -falign-jumps=4 -falign-functions=16 / UNSPECIFIED() RATIOS:1

With ALU bypass:

CoreMark/MHz 1.0 : 4.24 / GCC13.2.0 -O3 -g -fno-tree-loop-distribute-patterns -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -lgcc -funroll-all-loops -ffunction-sections -fdata-sections -Wl,-gc-sections -falign-jumps=4 -falign-functions=16 / UNSPECIFIED() RATIOS:1

Thus, there is a CoreMark improvement of ~4.17%.

May 19 '25 21:05 ricted98

Thank you, we see Coremark results are highly better than Dhrystone. But I again fear that the RTL modification is big in relation to the performance increase. With @cathales we tried to find solution to increase again the performance based on your PR. Do you thing the coverage of the fusion of ALU+BRANCH would be possible, this PR should not be far from supporting this.

May 22 '25 08:05 JeanRochCoulon

Thank you for the feedback. The ALU+BRANCH the fusion is possible, but it would likely lead to a long combinational path ALU2 (not branch capable) ->ALU (branch capable) -> BRANCH UNIT. I will need to run some synthesis if we want to go down this path. At the moment the only legal chaining is ALU->ALU2, so this path would not pop up.

Just to clarify, your concern is about excessive RTL changes in terms of changed lines of code or area overhead for limited performance gain?

May 22 '25 08:05 ricted98

:x: failed run, report available here.

May 22 '25 09:05 github-actions[bot]

If ALU2 + ALU + BRANCH does not impact critical path, this PR would be a great improvement for cva6 (to be confirmed on Coremark). If not, let's discuss.

Today code is fully verified in cv32a60x configuration (100% code and functional coverages). Modifying lines impact maturity status, and constraint verification team to continue their work. But if performance is at top, everything is possible !

May 22 '25 12:05 JeanRochCoulon

Thanks for the explanation. I will look into this change and let you know how the PPA is affected.

May 22 '25 17:05 ricted98

:x: failed run, report available here.

May 22 '25 17:05 github-actions[bot]

:x: failed run, report available here.

May 28 '25 16:05 github-actions[bot]

:x: failed run, report available here.

Jun 18 '25 11:06 github-actions[bot]

:x: failed run, report available here.

Jun 18 '25 15:06 github-actions[bot]

:x: failed run, report available here.

Jun 19 '25 09:06 github-actions[bot]

Hi @JeanRochCoulon, @cathales. Sorry for the delay but I finally had some time to look into the suggested change to allow ALU->BRANCH UNIT.

The PR as of now only allows ALU->ALU forwarding, with a somewhat more complex issue logic to reorder instructions on the two alus to allow forwarding. The first ALU is the branch capable one, while the second one is the non-branch capable one.

The changes to allow ALU->BRANCH UNIT forwarding can be found here: https://github.com/ricted98/cva6/commit/c5d9dd03fcd25ebe1bf02f415ee807b107b216f2 (branch rt/alu-bypass-squashed). I am excluding JALR for the moment, as it would require additional muxing.

On CoreMark, I have observed a 1% improvement over the plain ALU-ALU bypass. However, the critical path is now on the ALU->ALU->BRANCH UNIT ->FLUSH path. I can give more detailed synthesis results in the upcoming hours.

Should I add the ALU->BRANCH UNIT commit to this PR as well despite the timing regression?

Jun 19 '25 10:06 ricted98

cva6 cva6 copied to clipboard

Add ALU to ALU bypass in superscalar mode

cva6
cva6 copied to clipboard