calyx Feedback Directed Optimization

Feedback- or profile-directed optimizations (FDO) generally work by compiling a program once, profiling it at runtime, and then using the profiled information to optimize the program for the particular workload.

In hardware land, XLS uses an FDO approach: it takes a high-level program and conservatively pipelines it, runs it through the synthesis flow, collects information about slack on various paths, and re-pipelines the design to take advantage of the real results from synthesis.

I think we have a big opportunity to do this well with Calyx. Specifically, we can collect information about which sub-circuits are causing congestion/are on the critical path of the design, etc. and tune the various knobs exposed by the compiler. The big novelty win would be optimizing the control circuitry; most HLS compilers are incapable of optimizing the FSMs they generate to schedule the program. Calyx, on the other hand, makes it super easy: our FSM generation is compositional for dynamic programs, and we can easily generate more or less FSMs without affecting correctness.

Here are a couple of other knobs I think are worth thinking about:

How many FSMs to allocate for a given sub-program?
How do static promotion and schedule compaction affect the generated FSMs?
How does reducing the expressed parallelism in a program enable better FSM generation (#1833)?
Which signals are connected to a lot of drivers? This is particular important and hard to debug because it is likely one of the control signals like go or fsm and it is not obvious how to optimize them.

The victory condition for this is being able to take a large set of benchmarks and transparently improve their resource usage and frequency characteristics. We should also take inspiration from projects like Autobridge which reason about problems with control signal generation in multi-FPGA designs.

Dec 27 '23 17:12 rachitnigam

CC @andrewb1999 @sampsyo @EclecticGriffin

Dec 27 '23 17:12 rachitnigam

This is a super exciting direction! It's obviously part of a much longer discussion, but just to think through a few high-level discussion points:

This would have enormous researchy payoff if we had ≥2 different EDA backends. That is, if we could demonstrate that the same program needed to get optimized differently for two different FPGAs, or for an FPGA and an ASIC target. Merely parsing the output from two different such backends would be a heavy engineering lift, of course, but it could still be awesome.
IMO, there is a lot to do on even the first half of this vision: ingest output from the EDA toolchain and attribute costs to source code. This is maybe not too hard for resource consumption… maybe we can figure out a general way to map a report's module-level resource consumption to Calyx components and then to groups. But it seems harder for timing, i.e., somehow attributing critical paths back to pieces Calyx source code—I just don't yet know what that attribution would look like. This is not to say that it's impossible or too hard (it's not), just that this would be a big part of the work, even before we get to optimizations.
Relatedly: I think there is a lot of payoff to be had in merely reporting this stuff to humans as well. Using the feedback in another round of optimizations is great, but just telling the user "here is where the critical paths were" seems immensely useful. So I think a project around this could be structured in two long-term steps: one about just reporting for humans, and one in using that to perform optimizations. We could decide that generating excellent, interpretable, actionable feedback for humans is not worth the effort/reward trade-off, but I do not think we will know that until we can at least do the Calyx-level attribution (which is required for optimizations also).

Jan 04 '24 17:01 sampsyo

Note from looking at timing reports:

Max Delay Paths
--------------------------------------------------------------------------------------
Slack (MET) :             2.507ns  (required time - arrival time)
  Source:                 W_32/G_3970/c0/state0/out_reg[0]/C
                            (rising edge-triggered cell FDRE clocked by clk  {[email protected] [email protected] period=5.000ns})
  Destination:            W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
                            (rising edge-triggered cell DSP_A_B_DATA clocked by clk  {[email protected] [email protected] period=5.000ns})
  Path Group:             clk
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            5.000ns  (clk [email protected] - clk [email protected])
  Data Path Delay:        2.160ns  (logic 0.611ns (28.287%)  route 1.549ns (71.713%))
  Logic Levels:           4  (LUT3=1 LUT6=3)
  Clock Path Skew:        0.009ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    0.044ns = ( 5.044 - 5.000 ) 
    Source Clock Delay      (SCD):    0.035ns
    Clock Pessimism Removal (CPR):    0.000ns
  Clock Uncertainty:      0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Total Input Jitter      (TIJ):    0.000ns
    Discrete Jitter          (DJ):    0.000ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clk rise edge)        0.000     0.000 r  
                                                      0.000     0.000 r  clk (IN)
                         net (fo=412, unset)          0.035     0.035    W_32/G_3970/c0/state0/clk
    SLICE_X38Y125        FDRE                                         r  W_32/G_3970/c0/state0/out_reg[0]/C
  -------------------------------------------------------------------    -------------------
    SLICE_X38Y125        FDRE (Prop_DFF_SLICEL_C_Q)
                                                      0.096     0.131 r  W_32/G_3970/c0/state0/out_reg[0]/Q
                         net (fo=21, routed)          0.467     0.598    W_32/SER_3/G_3650/c0/state0/mul_uint8_i_67_0[0]
    SLICE_X39Y124        LUT6 (Prop_E6LUT_SLICEM_I1_O)
                                                      0.101     0.699 r  W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77/O
                         net (fo=2, routed)           0.145     0.844    W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77_n_0
    SLICE_X39Y124        LUT3 (Prop_H5LUT_SLICEM_I0_O)
                                                      0.198     1.042 r  W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69/O
                         net (fo=8, routed)           0.465     1.507    W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69_n_0
    SLICE_X41Y123        LUT6 (Prop_F6LUT_SLICEL_I0_O)
                                                      0.177     1.684 r  W_32/SER_3/G_3650/c0/state0/mul_uint8_i_26/O
                         net (fo=1, routed)           0.223     1.907    W_32/G_3970/c0/state0/prev_reg[5]_1
    SLICE_X40Y122        LUT6 (Prop_B6LUT_SLICEM_I5_O)
                                                      0.039     1.946 r  W_32/G_3970/c0/state0/mul_uint8_i_3/O
                         net (fo=2, routed)           0.249     2.195    W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/A[5]
    DSP48E2_X3Y50        DSP_A_B_DATA                                 r  W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
  -------------------------------------------------------------------    -------------------

                         (clock clk rise edge)        5.000     5.000 r  
                                                      0.000     5.000 r  clk (IN)
                         net (fo=412, unset)          0.044     5.044    W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/CLK
    DSP48E2_X3Y50        DSP_A_B_DATA                                 r  W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/CLK
                         clock pessimism              0.000     5.044    
                         clock uncertainty           -0.035     5.009    
    DSP48E2_X3Y50        DSP_A_B_DATA (Setup_DSP_A_B_DATA_DSP48E2_CLK_A[5])
                                                     -0.307     4.702    W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST
  -------------------------------------------------------------------
                         required time                          4.702    
                         arrival time                          -2.195    
  -------------------------------------------------------------------
                         slack                                  2.507

The fo=21 number here specifies that the fanout factor of W_32/G_3970/c0/state0/out_reg[0]/Q is 21. Again, this information should be extractable from a parser pretty easily. More information, like the split between the path delay and routing delay can be understood by reading this ChatGPT transcript.

Mar 28 '24 02:03 rachitnigam

calyx calyx copied to clipboard

Feedback Directed Optimization

calyx
calyx copied to clipboard