blog icon indicating copy to clipboard operation
blog copied to clipboard

25/09/2020: FPGAs are Magic (I).

Open Ravenslofty opened this issue 3 years ago • 1 comments

FPGAs Are Magic I

Any sufficiently advanced technology is indistinguishable from magic.

  • Arthur C. Clark

FPGAs are magic.

On my Twitter account, I have been posting diagrams of the Logic Parcel in a Hercules Micro HME-M7 and some routing that gets signals into the Logic Parcel, and the reactions were not the best.

So let's build one!

My only real prior knowledge of FPGA architecture is cursory knowledge of the Lattice iCE40 and ECP5 and Xilinx 7 Series, and some more detailed knowledge of the Intel Cyclone V and Hercules HME-M7. I am relatively familiar with the Yosys toolchain, however, and I'll be using that as a synthesis tool.

Implementing a LUT and flop in nMigen

I'm going to skip a lot of the set-up, but I'm using nMigen, the 130nm SkyWater PDK, and OpenSTA.

To achieve anything, we need a LUT and a D flip-flop, and since I am a staunch advocate of ABC9, we need timings, too.

Let's imagine the most boring possible LUT with no carry logic, and the most boring possible D flip-flop with no init or resets.

Since we're experimenting, the LUT needs to be relatively flexibly designed. ASICs generally use latches instead of flops for storage, as they're smaller, so we'll use that for our LUT storage.

from nmigen import *
from nmigen.back import verilog


class Lut(Elaboratable):
    def __init__(self, k):
        """Instantiates a `k`-input, single-output LUT"""
        assert k > 1

        self.k      = k

        self.inputs = Signal(k) # LUT read address
        self.output = Signal(1, reset_less=True) # LUT read data

        # We need a write port to (theoretically) load our LUT data from.
        self.lut_wd = Signal(2**k) # LUT write data
        self.lut_we = Signal(1)    # LUT write enable (negative-true)
    
    def elaborate(self, platform):
        m = Module()

        inputs = Signal(self.k, reset_less=True)
        latchout = Signal(2**self.k)

        for i in range(2**self.k):
            # This is the latch primitive for the SkyWater PDK; Yosys can't map this to the cell by itself.
            m.submodules += Instance(
                "sky130_fd_sc_hs__dlxtn_1",
                i_D=self.lut_wd[i],
                i_GATE_N=self.lut_we,
                o_Q=latchout[i]
            )

        # HACK: OpenSTA measures delay between synchronous endpoints, so we need to encase the LUT in flops to time it.
        m.d.sync += [
            inputs.eq(self.inputs),
            self.output.eq(latchout.bit_select(inputs, width=1)),
        ]

        return m


class Dff(Elaboratable):
    def __init__(self):
        """Instantiates a simple D flip-flop"""
        self.d = Signal(1)
        self.q = Signal(1)

    def elaborate(self, platform):
        m = Module()

        m.d.sync += self.q.eq(self.d)

        return m


# Dump LUT as Verilog
with open("lut.v", "w") as f:
    lut = Lut(4) # Adjust as appropriate.
    ports = [
        lut.inputs,
        lut.lut_wd,
        lut.lut_we,

        lut.output
    ]
    f.write(verilog.convert(lut, ports=ports))

# Dump DFF as Verilog
with open("dff.v", "w") as f:
    dff = Dff()
    ports = [
        dff.d,
        dff.q,
    ]
    f.write(verilog.convert(dff, ports=ports))

Using Yosys for ASIC synthesis.

Now we need Yosys to perform ASIC synthesis, and this process is...not very well documented, so here's my stab at it. I'm going to use the high-speed SkyWater cell library (sky130_fd_sc_hs), because this is a thought experiment and concerns like area and power usage don't matter to me right now.

I'm using a pretty conservative ABC synthesis script because I want ABC to use the multiplexer cells in the library. More aggressive synthesis tends to break up the structure of the mux.

# Read the cells from the cell library, but treat them as black boxes.
read_liberty -lib /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Perform generic synthesis.
synth -auto-top -flatten

# Attempt to find cells that could be merged.
share -aggressive

# Clean up the design, removing dead cells and wires.
opt -purge

# Map flops in the design to the cell library.
dfflibmap -liberty /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Map combinational cells in the design to the cell library, targeting smallest-possible delay.
abc -D 1 -liberty /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Any undefined bits should be zeroes.
setundef -zero

# Break apart multi-bit wires into multiple single-bit wires.
splitnets

# Remove any wires that are dead because of that.
opt -purge

# Give cells better names 
autoname

# Pretty-print statistics about the design.
stat -liberty /path/to/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

# Write the result, so that OpenSTA can use it.
write_verilog -noattr -noexpr -nohex -nodec netlist.v

Which prints something like this:

12. Printing statistics.

=== top ===

   Number of wires:                 32
   Number of wire bits:             50
   Number of public wires:          32
   Number of public wire bits:      50
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 26
     sky130_fd_sc_hs__dfxtp_1        5
     sky130_fd_sc_hs__dlxtp_1       16
     sky130_fd_sc_hs__mux4_1         5

   Chip area for module '\top': 704.894400

Using OpenSTA for timing measurement

And then we need an OpenSTA script to print timing information about it.

If you missed the comment in the nMigen source, OpenSTA measures timing between synchronous endpoints, but LUTs are combinational, so we encase the LUT in flops to measure delays. This makes checking timing a bit messy.

# Basic unit setup.
set_cmd_units -time ps -power W -current mA -voltage V

# If we use a cell that doesn't exist in the library, fail.
set link_make_black_boxes 0

# We are using one timing corner only.
define_corners tt_025C_1v80
read_liberty -corner tt_025C_1v80 /mnt/d/skywater-pdk/libraries/sky130_fd_sc_hs/latest/timing/sky130_fd_sc_hs__tt_025C_1v80.lib

read_verilog netlist.v

# Instantiate the design.
link_design top

# Create a clock, but we don't actually care about its time period, or its delays to inputs and outputs.
create_clock -name clk -period 0 {clk}
set_input_delay -clock clk 0 {rst inputs lut_wa lut_wd lut_we}
set_output_delay -clock clk 0 {output}

# Then, calculate the timing delays from inputs to output.
report_checks -from "inputs\$1[0]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[1]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[2]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[3]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[4]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[5]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[6]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3
report_checks -from "inputs\$1[7]_sky130_fd_sc_hs__dfxtp_1_Q" -to output_sky130_fd_sc_hs__dfxtp_1_Q -digits 3

exit

This will give you entries that look like this.

Startpoint: inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q
            (rising edge-triggered flip-flop clocked by clk)
Endpoint: output_sky130_fd_sc_hs__dfxtp_1_Q
          (rising edge-triggered flip-flop clocked by clk)
Path Group: clk
Path Type: max

   Delay     Time   Description
-----------------------------------------------------------
   0.000    0.000   clock clk (rise edge)
   0.000    0.000   clock network delay (ideal)
   0.000    0.000 ^ inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q/CLK (sky130_fd_sc_hs__dfxtp_1)
   2.430    2.430 ^ inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q/Q (sky130_fd_sc_hs__dfxtp_1)
   0.371    2.800 v latchout[160]_sky130_fd_sc_hs__mux4_1_A0/X (sky130_fd_sc_hs__mux4_1)
   0.107    2.908 ^ latchout[160]_sky130_fd_sc_hs__mux4_1_A0_X_sky130_fd_sc_hs__o21ai_1_A2/Y (sky130_fd_sc_hs__o21ai_1)
   0.068    2.976 v $2_sky130_fd_sc_hs__mux4_1_X_A2_sky130_fd_sc_hs__mux4_1_X_A3_sky130_fd_sc_hs__o32ai_1_Y/Y (sky130_fd_sc_hs__o32ai_1)
   0.248    3.223 v $2_sky130_fd_sc_hs__mux4_1_X_A2_sky130_fd_sc_hs__mux4_1_X/X (sky130_fd_sc_hs__mux4_1)
   0.225    3.448 v $2_sky130_fd_sc_hs__mux4_1_X/X (sky130_fd_sc_hs__mux4_1)
   0.000    3.448 v output_sky130_fd_sc_hs__dfxtp_1_Q/D (sky130_fd_sc_hs__dfxtp_1)
            3.448   data arrival time

   0.000    0.000   clock clk (rise edge)
   0.000    0.000   clock network delay (ideal)
   0.000    0.000   clock reconvergence pessimism
            0.000 ^ output_sky130_fd_sc_hs__dfxtp_1_Q/CLK (sky130_fd_sc_hs__dfxtp_1)
  -0.129   -0.129   library setup time
           -0.129   data required time
-----------------------------------------------------------
           -0.129   data required time
           -3.448   data arrival time
-----------------------------------------------------------
           -3.577   slack (VIOLATED)

Here, OpenSTA is printing:

  • "arrival time", which is the time it takes for data to propagate through the network and stabilise.
  • "required time", which is the actual length a clock pulse is, minus the time an input needs to be stable for the flop to register it (the "setup time").
  • "slack", which is the required time minus the arrival time. If slack is positive, the design should run at this clock frequency; if slack is negative, it likely won't.

We don't care about required time at all (which is why the clock length is zero) as this is an asynchronous logic element. Arrival time is important, however, as it contains the actual timings for the LUT.

The OpenSTA script reports timings from each input to the LUT output, and this is the data we'll need for ABC9, but also an annoying synchronous delay: flops naturally have a delay from when the clock edge rises to when the output changes. This delay (also called "arrival time") needs to be excluded from the timings.

The flop arrival time information is this line in the output:

   2.430    2.430 ^ inputs$1[0]_sky130_fd_sc_hs__dfxtp_1_Q/Q (sky130_fd_sc_hs__dfxtp_1)

Which tells us that there is a 2.43 nanosecond delay between the clock edge (the /CLK entry above it) and the flop output (Q) changing.

The total arrival time is in this line in the output:

            3.448   data arrival time

Which tells us there is a 3.448 nanosecond delay between the clock edge and the LUT output changing.

To find the actual delay, we just subtract the flop arrival time from the total arrival time, to get 3.448ns - 2.43ns = 1.018ns input to output delay.

To put that into perspective, here are the slowest input to output delays of some commercial FPGAs in Yosys:

  • Xilinx 7 Series (LUT6, 28nm): 0.642ns
  • Intel Cyclone V (LUT6, 28nm): 0.602ns
  • Lattice ECP5 (LUT4, 40nm): 0.379ns
  • Lattice iCE40HX (LUT4, 40nm): 0.449ns
  • Lattice iCE40UP (LUT4, 40nm): 1.285ns
  • Gowin GW1N (LUT4, 55nm): 1.638ns

Rinse and repeat for the size of LUT you're interested in. Don't assume that the timings of the same LUT with an extra input look similar; the resulting network could have different delay characteristics.

Alternatively, here are some timings I made earlier.

In part II, I'll be going into how these timings can be used in a Yosys FPGA flow to test and measure improvements.

Ravenslofty avatar Sep 26 '20 00:09 Ravenslofty

So I made a small update to this, changing from a dltxp to a dltxn, because it turns out that it's more efficient to have a negative-true enable in a cell than a positive-true enable (which is converted to negative-true through a built-in inverter).

A software idiom is to not care about small efficiencies - the cell change only saves 3 units of area (um^2?) - but especially with larger LUTs you're going to be instantiating a lot of latches, and this change quickly adds up.

This is not actually my first attempt at an FPGA; my very first attempt used a shift register for the LUT, and that was unnecessarily big.

Could this be improved, size-wise? I think doing so would require designing custom cells, but I'm not even going to attempt that.

Ravenslofty avatar Sep 27 '20 12:09 Ravenslofty