blog icon indicating copy to clipboard operation
blog copied to clipboard

09/10/2020: FPGAs are Magic (II)

Open Ravenslofty opened this issue 3 years ago • 2 comments

FPGAs Are Magic II

More boilerplate (but we're getting somewhere!)

The goal of this post is the FPGA-flow related boilerplate, as opposed to the ASIC-flow related boilerplate of part I.

Our goal is to make the FPGA "good", but how do we quantify "good"?

We could measure the area of an FPGA design by multiplying the number of LUTs needed for it by the die area per LUT, giving us the die area needed to implement that design on the chip. Equally, we could measure the speed of an FPGA by calculating the critical-path delay of a design implemented for it.

I decided to target both: by multiplying the area of the design by its critical-path delay, you get a figure called the delay-area product (DAP), which you can minimise.

From Yosys' point of view, area statistics can be calculated from the stat (statistics) command; critical-path delay is a bit trickier. I opted to use the sta (static timing analysis) command from Yosys' eddie/sta branch, which re-uses timing information provided to ABC9 to calculate the critical path. Even though that branch is a little outdated (it was last rebased in May), it's modern enough for our purposes.

Simulation models

Let's start off by writing simulation models of the LUT and DFF cells. I went up to LUT8s in my timing information, so here's a LUT8. We can simulate smaller LUTs by just tying inputs to constants and limiting the maximum LUT size ABC uses.

I'll call this file rain_sim.v. We'll need to reference it in the synthesis script.

// The `(* abc9_lut=N *)` informs ABC9 that this cell describes the timings for
// a LUT, and the `N` parameter informs ABC9 of its relative area.
// e.g. on a frangible LUT6, you might want to inform ABC9 that a LUT6 uses 
// twice as much relative area as a LUT3 because you can pack two LUT3s together.
(* abc9_lut=1 *)
module rain_lut(input A, B, C, D, E, F, G, H, output Q);

parameter INIT = 0;

// Here's the other important part, a specify block, which describes cell timings
// to ABC9 and sta.
//
// `(A => Q) = N` means "there is a combinational path between A and Q, which has
// a delay of N units". By Yosys convention, these units are picoseconds.
// Note that `=>` is a one-to-one relationship; both sides must be the same width
// (1 bit in this case).
//
// To give a small example, from the last post I calculated a 1.018ns delay between 
// input H and the output. That would look like `(H => Q) = 1018;`.
specify
    (A => Q) = 0; // Fill
    (B => Q) = 0; // these
    (C => Q) = 0; // with
    (D => Q) = 0; // your
    (E => Q) = 0; // timing
    (F => Q) = 0; // information
    (G => Q) = 0; // in
    (H => Q) = 0; // picoseconds
endspecify

// This is pessimistic when it comes to Verilog x-propagation, but it's simple.
assign Q = INIT >> {H, G, F, E, D, C, B, A};

endmodule


module rain_dff(input CLK, D, output reg Q);

// Flops also have timing information, but since we're focusing on LUTs right
// now, it's okay to keep the delays at zero, even though they're unrealistic.
//
// `(posedge CLK => (Q : D)) = N` means "there is a path from D to Q that 
// changes on the positive edge of CLK, which has a delay of N units".
// This is the delay between the clock edge rising and the flop output changing.
//
// `$setup(D, posedge CLK, N)` means "the input at D must be stable for at least
// N units before the positive edge of CLK".
// It is possible to have negative setup times (where the input at D can 
// stabilise after the clock edge), but ABC9 doesn't support these and they will
// be clamped to zero with a warning; it's best to just put the actual value in 
// a comment and leave the field at zero.
specify
    (posedge CLK => (Q : D)) = 0;
    $setup(D, posedge CLK, 0);
endspecify

always @(posedge CLK)
    Q <= D;

endmodule

I named my FPGA "Rain" because Sky...Water...

You may also be wondering why I didn't use the term "fracturable LUT", and the answer is because "frangible" is funnier.

Mapping

ABC9 will produce a netlist which uses Yosys-internal $lut cells. We need to map those to our LUT model, using the Yosys techmap pass that takes in a Verilog file.

I'll call this file rain_map.v.

// This uses an obscure corner of the Verilog standard: raw identifiers.
// An identifer starting with \ is parsed as an identifer until whitespace.
// This allows you to use special characters like $ and [].
module \$lut (A, Y);

parameter WIDTH = 0;
parameter LUT = 0;

// (* force_downto *) tells Yosys what to do in the event WIDTH is zero and this
// wire is [-1:0]: treat -1 as the MSB index and 0 as the LSB index, producing a
// zero-width wire that is otherwise unachievable in Verilog.
(* force_downto *)
input [WIDTH-1:0] A;
output reg Y;

// `_TECHMAP_REPLACE_` is a cell name specially recognised by `techmap`.
// It give this cell the name and attributes of the cell it replaces, which is
// useful in 1:1 mappings like this.
//
// Likewise, `_TECHMAP_FAIL_` is a wire name specially recognised by `techmap`.
// When set to 1, the cell is skipped and this mapper has no effect, which means
// we don't have to map for all WIDTHs.
generate
    if (WIDTH == 1) begin
        rain_lut #(.INIT({128{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(1), .C(1), .D(1), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 2) begin
        rain_lut #(.INIT({64{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(1), .D(1), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 3) begin
        rain_lut #(.INIT({32{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(1), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 4) begin
        rain_lut #(.INIT({16{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(1), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 5) begin
        rain_lut #(.INIT({8{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(1), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 6) begin
        rain_lut #(.INIT({4{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(A[5]), .G(1), .H(1), .Q(Y));
    end else
    if (WIDTH == 7) begin
        rain_lut #(.INIT({2{LUT}})) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(A[5]), .G(A[6]), .H(1), .Q(Y));
    end else
    if (WIDTH == 8) begin
        rain_lut #(.INIT(LUT)) _TECHMAP_REPLACE_ (.A(A[0]), .B(A[1]), .C(A[2]), .D(A[3]), .E(A[4]), .F(A[5]), .G(A[6]), .H(A[7]), .Q(Y));
    end else
    begin
        wire _TECHMAP_FAIL_ = 1;
    end
endgenerate

endmodule


// Yosys' internal flop names get *complicated*, but this one means "a positive-
// edge clocked D flip flop".
module \$_DFF_P_ (input D, C, output reg Q);

rain_dff _TECHMAP_REPLACE_ (.CLK(C), .D(D), .Q(Q));

endmodule

Synthesis

Now we need another Yosys synthesis script, but this time for FPGAs instead of ASICs.

# Read the cells from the cell library, including their specify blocks, but treat
# them as black boxes.
read_verilog -lib -specify /path/to/lut_sim.v

# Perform generic synthesis for LUTs. Here, I'm targeting LUT8s, but you can change
# the -lut parameter to change that. More importantly, I'm stopping the `synth` pass
# partway through so we can manually set mapping options.
synth -auto-top -flatten -run :fine -lut 8

# Clean up the design.
opt -full

# Map the flops to the one flop type we currently have: an uninitialised D flip-flop.
# This will break on any design that requires initialised D flip-flops, or D latches, 
# but we'll get to that.
dfflegalize -cell $_DFF_P_ x

# Map the design to LUTs using ABC9.
# -maxlut tells ABC9 what the largest LUT is of this architecture (8)
# -W is a fudge factor representing interconnect delay, encouraging ABC9 to pack LUTs
# more efficiently.
abc9 -maxlut 8 -W 2000

# Clean up the design again.
opt -full

# It's useful to know the relative sizes of LUTs produced in the design, but since 
# there is one LUT cell in the library (for the sake of brevity), we'll check the 
# width of the Yosys-internal $lut cells before mapping them to rain_lut to obtain
# this information.
stat -width

# Map the $lut cells from ABC9 into our rain_lut cells.
techmap -map lut_map.v

# Run static timing analysis based on the information in the specify blocks to 
# calculate the critical path. Since the timing information is in the rain_lut 
# cells, we need to techmap before running this.
sta

You'll probably want to write a script to vary the size of the LUT and change the timings as necessary. Then you can extract the relative LUT sizes from the output of stat and the critical path from sta. I picked a few of the benchmarks from the EPFL combinational benchmark suite and here are my results.

Size Relative Speed Relative Area Relative Delay-Area
LUT2 63.63% 103.76% 117.36%
LUT3 78.78% 113.34% 102.43%
LUT4 86.75% 202.30% 167.45%
LUT5 95.78% 315.72% 242.48%
LUT6 86.32% 566.59% 476.50%
LUT7 92.62% 1008.38% 828.96%
LUT8 68.87% 1842.33% 1954.12%

"Relative Speed" measures how fast the end result can go. We don't want to artificially hobble the chip by using a slow architecture.

"Relative Area" measures the total die area needed to implement that design. We don't want to spend an excessive amount of area for the architecture.

"Relative Delay-Area" measures how fast the end result is for its area. We want a design which provides the best performance for its area.

From the data, implementing designs using LUT2s is slower and less efficient than LUT3s, because you need a lot of them for the same design. Conversely, using LUT8s is slower and less efficient because it's difficult to use all of a LUT8, and all the logic results in slow switching speeds. Let's discount them both.

The LUT5 is the fastest architecture here, with the LUT7 a close second. This is surprising to me, because commercial FPGAs use LUT6s and LUT4s, and I was expecting these to be more competitive.

The LUT2 is the smallest design, with the LUT3 close behind. This makes sense; they're the smallest LUTs, and going from LUT2 to LUT3 is a significant efficiency boost.

The lack of performance for the LUT2 is punished by the delay-area product, and so the LUT3 is the best there.

If I was going to implement an FPGA from that data, I would pick the LUT5. It performs the best, and I think the performance of smaller LUTs will diminish when routing costs come into play.

But there are some other factors to be explored; you can combine smaller LUTs with fast muxes to increase performance, and you can use multiple outputs on large LUTs to reduce area.

In the next post, I will explore these.

Ravenslofty avatar Oct 09 '20 22:10 Ravenslofty

Interesting stuff. There are as you know a rich literature on this topic and this paper https://ieeexplore.ieee.org/document/1281800 seems to agree with your result (disclaimer: I have only read the abstract).

EDIT: I suspect LUT4 and LUT6 are preferred because 4:1 muxes map well to both, but are inefficient on LUT5.

tommythorn avatar Oct 10 '20 00:10 tommythorn

That article is actually for a bit later in the series! I'll cover that when I get to interconnect.

On Sat, 10 Oct 2020, 01:20 Tommy Thorn, [email protected] wrote:

Interesting stuff. There are as you know a rich literature on this topic and this paper https://ieeexplore.ieee.org/document/1281800 seems to agree with your result (disclaimer: I have only read the abstract).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Ravenslofty/blog/issues/2#issuecomment-706452797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPDWYOJ6K4U3VKPOKAGLTSJ6SDJANCNFSM4SKTOFLA .

Ravenslofty avatar Oct 10 '20 09:10 Ravenslofty