VexRiscv icon indicating copy to clipboard operation
VexRiscv copied to clipboard

FPGA synthesis of deep combinational logic (layered barrel shifter implementation)

Open jeras opened this issue 2 years ago • 4 comments

This is not a bug, I would just like to discuss some of my findings. I choose VexRiscv, since it seems to be the the most prominent FPGA implementation of RISC-V. I did not synthesize VexRiscv myself yet, mostly because it needs many Scala related dependencies and my Ubuntu is already messy.

I implemented a 2-stage RISC-V with CPI=1, almost all the logic is combinational, only about 32+5+5+1+14 registers are needed (XLEN for writeback buffer, 5 for destination GPR address, 1 for GPR write enable, 14 for PC, 5 writeback mux opcode'length). I am running synthesis for the Xilinx Artix (Vivado) and Intel/Altera Cyclone V (Quartus) devices using latest free vendor tools. Most of the following text will be about Xilinx, since I spent a bit more time there. I know the numbers of LUT and ALM can't be compared directly, so I will talk about relative gains due to a change. First I had to remove some convoluted control signal encoding choices and use native opcode/func3 encoding this got me from 900LUT to about 600LUT, I will not talk further about it here, it is a bit embarrassing.

I was still about 40% away from the target, so I had a look at the netlist (actually I looked at the hierarchy in Vivado). I noticed a lot of the combinational logic is pushed to the end of the stage by the synthesis tool. I expected to see something similar to my write back multiplexer, but instead I saw something much, much larger.

Apparently the main culprit was the barrel shifter. If written in the simple form (I already used bit reverse for left shifts):

assign out = in >> amount;

Than out[0] is implemented as a multiplexer for all bits from in[31:0]. The next bit out[1] is multiplexed from a one bit shorter vector in[31:1] and so on, so some kind of triangular structure is created. To me it appears like synthesis tools does not infer a dedicated structure for a barrel shifter, like there is an adder with carry propagation. Instead each output is calculated directly from inputs without reusing intermediate results.

This consumed a lot of LUTs and also affected timing (I am not sure whether it was due to congestion, or the ALU adder path was not given priority in the final multiplexing).

My first change was to add a keep attribute to the barrel shifter output, before it was multiplexed with arithmetic and logic operations at the output of the ALU. This provided some area and timing improvements.

Than I changed the synthesis configuration to optimize area instead of timing, this seems to avoid pushing combinational logic forward closer to registers, instead more of the code written serially is synthesized as is, more intermediate signals are kept. For a multistage CPU there are intermediate signals forced by registers, in my CPU I need a different mechanism.

In the next step I explicitly wrote the barrel shifter to only process one bit of amount[4:0] at a time, similar to how his is explained in the bit-manipulation extension documentation.

I found out that for Cyclone V gives best results when 1 bit of amount is used in each layer: https://github.com/jeras/rp32/blob/master/hdl/rtl/core/r5p_alu.sv#L291-L326 The same approach would probably be a good choice for LUT4 devices.

On Xilinx Arty with LUT6 a different approach taking 2 bit amount provides better results: https://github.com/jeras/rp32/blob/master/hdl/rtl/core/r5p_alu.sv#L291-L326 Well at least in some builds with extensive keep use this proved better than the above single amount bit structure, but in my latest builds the single amount bit version was better.

In the end I removed all keep attributes for Vivado, but kept one for Quartus. I will probably add a few back, but it is difficult (time consuming with a change/synthesize/check repeat loop) to figure out which placement would give the best results, some kind of AI tool would be needed here.

The point of this post would be:

  1. A layered barrel shifter gives better results than a trivial implementation!
  2. Adding keep attribute at strategic points might improve synthesis.
  3. I have to automate my build system to handle various parameter configurations.

jeras avatar Apr 10 '22 11:04 jeras

Hi ^^

Thanks for the issue :D

I noticed a lot of the combinational logic is pushed to the end of the stage by the synthesis tool.

Yes, i also noticed that VIvado was heavily modifying some combinatorial path compared to the way they where expressed in the HDL, sometime in some very dump ways. I had to add a few KEEP attributes in a few places to improve timings.

        val sel = in UInt(log2Up(width) bits)
        val din = in Bits(width bits)
        val dout = out(din >> sel)

So, it seems Vivado use luts as Mux 4->1, and implement a 32 bits shift into 3 full layers + some scrap (71 lut total) image

Seems that timing speaking it isn't bad, it may use a few lut too much ? But, it is a isolated test, would it be possible that vivado kind of merged some other parts of your design into the barrel shifter and messed it up badly ?

Do you remember how badly the barrel shifter was implemented by vivado in your test cases ?

Dolu1990 avatar Apr 11 '22 09:04 Dolu1990

I am unable to reproduce the advantage of 2-bit per layer barrel shifter fitting into lut6 using Vivado. Compared to the last time I have reduced utilization significantly. I will have to spend some time looking at the netlist.

I would like to point out another optimization that fits into LUT4. All logical operations (AND/OR/XOR) can be implemented inside a single XLEN long LUT4 array. Each LUT4 in the array would have the two GPR as inputs and 2 bits from func3 to select the logic function. On Quartus it was enough for me to write a separate signal:

// this can be implemented with a single LUT4
always_comb
unique case (ctl.i.alu.f3)
  // bitwise logical operations
  AND    : log_val = log_op1 & log_op2;
  OR     : log_val = log_op1 | log_op2;
  XOR    : log_val = log_op1 ^ log_op2;
  default: log_val = 'x;
endcase

On Vivado It worked after I changed optimization settings toward area. It seems you already did this, but do not read Scala well.

Timing wise it also make sense to mux logic results early, since paths are much shorter compared to the adder carry chain. Similarly a barrel shifter can have shorter paths compared to the adder, so it makes sense to use the most dense implementation.

jeras avatar Apr 11 '22 17:04 jeras

I am unable to reproduce the advantage of 2-bit per layer barrel shifter fitting into lut6 using Vivado.

To be clear, by 2 bits per layer i mean 2 bits of the shift amount resolved per layer => 4->1 muxes

I would like to point out another optimization that fits into LUT4.

Ahh right. I think what i tried to do with the VexRiscv implementation is to relay the ADD_SUB / SLT paths as much as possible, to improve timings https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/plugin/IntAluPlugin.scala#L96

In general, VexRiscv is more tunned toward FMax than Area, aswell than not realy tunned to fit a particular LUT input count :)

Dolu1990 avatar Apr 13 '22 08:04 Dolu1990

An interesting thing: in an FPGA it can -- counterintuitively -- be advantageous to use multiplier blocks to implement the barrel shifter. I have experimented with using two 16x16 → 32 multiplier blocks (each implementing a 16 → 31 shift) plus two layer of LUT logic to implement shift, rotate, sign extend, zero extend, and the single-bit bit manipulation (Zbs) instructions. However, my implementation was for another project and so is in SystemVerilog and not Scala/SpinalHDL, and I do yet have permission to release the code (working on it.)

Note that shift right, sign extend, zero extend, and bit extract are all just special cases of the general operation of bitfield extraction with or without sign extension. Shift left is equivalent to rotate with a subsequent mask.

The signed bitfield extraction operation requires a MUX to extract the sign bit (parallel with the rotator); in combination with a 0/1/pass/invert operator (a single 3-input LUT) this can produce zero for left shift or unsigned extract, or the target bit value for the bit manipulation instructions.

The final step is a 3-way MUX between the lower shifter, the upper shifter, and the modified sign bit. The lower/upper control bit is simply the MSB of the normalized shift count; the shifter vs sign bit MUX is, however, a bitmask that depends both on the shift count and the specific operation. However, this bitmask, too, can be calculated in parallel with the shifter array.

hpax avatar Feb 23 '23 17:02 hpax