calyx
calyx copied to clipboard
Feedback Directed Optimization
Feedback- or profile-directed optimizations (FDO) generally work by compiling a program once, profiling it at runtime, and then using the profiled information to optimize the program for the particular workload.
In hardware land, XLS uses an FDO approach: it takes a high-level program and conservatively pipelines it, runs it through the synthesis flow, collects information about slack on various paths, and re-pipelines the design to take advantage of the real results from synthesis.
I think we have a big opportunity to do this well with Calyx. Specifically, we can collect information about which sub-circuits are causing congestion/are on the critical path of the design, etc. and tune the various knobs exposed by the compiler. The big novelty win would be optimizing the control circuitry; most HLS compilers are incapable of optimizing the FSMs they generate to schedule the program. Calyx, on the other hand, makes it super easy: our FSM generation is compositional for dynamic programs, and we can easily generate more or less FSMs without affecting correctness.
Here are a couple of other knobs I think are worth thinking about:
- How many FSMs to allocate for a given sub-program?
- How do static promotion and schedule compaction affect the generated FSMs?
- How does reducing the expressed parallelism in a program enable better FSM generation (#1833)?
- Which signals are connected to a lot of drivers? This is particular important and hard to debug because it is likely one of the control signals like
goorfsmand it is not obvious how to optimize them.
The victory condition for this is being able to take a large set of benchmarks and transparently improve their resource usage and frequency characteristics. We should also take inspiration from projects like Autobridge which reason about problems with control signal generation in multi-FPGA designs.
CC @andrewb1999 @sampsyo @EclecticGriffin
This is a super exciting direction! It's obviously part of a much longer discussion, but just to think through a few high-level discussion points:
- This would have enormous researchy payoff if we had ≥2 different EDA backends. That is, if we could demonstrate that the same program needed to get optimized differently for two different FPGAs, or for an FPGA and an ASIC target. Merely parsing the output from two different such backends would be a heavy engineering lift, of course, but it could still be awesome.
- IMO, there is a lot to do on even the first half of this vision: ingest output from the EDA toolchain and attribute costs to source code. This is maybe not too hard for resource consumption… maybe we can figure out a general way to map a report's module-level resource consumption to Calyx components and then to groups. But it seems harder for timing, i.e., somehow attributing critical paths back to pieces Calyx source code—I just don't yet know what that attribution would look like. This is not to say that it's impossible or too hard (it's not), just that this would be a big part of the work, even before we get to optimizations.
- Relatedly: I think there is a lot of payoff to be had in merely reporting this stuff to humans as well. Using the feedback in another round of optimizations is great, but just telling the user "here is where the critical paths were" seems immensely useful. So I think a project around this could be structured in two long-term steps: one about just reporting for humans, and one in using that to perform optimizations. We could decide that generating excellent, interpretable, actionable feedback for humans is not worth the effort/reward trade-off, but I do not think we will know that until we can at least do the Calyx-level attribution (which is required for optimizations also).
Note from looking at timing reports:
Max Delay Paths
--------------------------------------------------------------------------------------
Slack (MET) : 2.507ns (required time - arrival time)
Source: W_32/G_3970/c0/state0/out_reg[0]/C
(rising edge-triggered cell FDRE clocked by clk {[email protected] [email protected] period=5.000ns})
Destination: W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
(rising edge-triggered cell DSP_A_B_DATA clocked by clk {[email protected] [email protected] period=5.000ns})
Path Group: clk
Path Type: Setup (Max at Slow Process Corner)
Requirement: 5.000ns (clk [email protected] - clk [email protected])
Data Path Delay: 2.160ns (logic 0.611ns (28.287%) route 1.549ns (71.713%))
Logic Levels: 4 (LUT3=1 LUT6=3)
Clock Path Skew: 0.009ns (DCD - SCD + CPR)
Destination Clock Delay (DCD): 0.044ns = ( 5.044 - 5.000 )
Source Clock Delay (SCD): 0.035ns
Clock Pessimism Removal (CPR): 0.000ns
Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.071ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.000ns
Phase Error (PE): 0.000ns
Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
------------------------------------------------------------------- -------------------
(clock clk rise edge) 0.000 0.000 r
0.000 0.000 r clk (IN)
net (fo=412, unset) 0.035 0.035 W_32/G_3970/c0/state0/clk
SLICE_X38Y125 FDRE r W_32/G_3970/c0/state0/out_reg[0]/C
------------------------------------------------------------------- -------------------
SLICE_X38Y125 FDRE (Prop_DFF_SLICEL_C_Q)
0.096 0.131 r W_32/G_3970/c0/state0/out_reg[0]/Q
net (fo=21, routed) 0.467 0.598 W_32/SER_3/G_3650/c0/state0/mul_uint8_i_67_0[0]
SLICE_X39Y124 LUT6 (Prop_E6LUT_SLICEM_I1_O)
0.101 0.699 r W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77/O
net (fo=2, routed) 0.145 0.844 W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77_n_0
SLICE_X39Y124 LUT3 (Prop_H5LUT_SLICEM_I0_O)
0.198 1.042 r W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69/O
net (fo=8, routed) 0.465 1.507 W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69_n_0
SLICE_X41Y123 LUT6 (Prop_F6LUT_SLICEL_I0_O)
0.177 1.684 r W_32/SER_3/G_3650/c0/state0/mul_uint8_i_26/O
net (fo=1, routed) 0.223 1.907 W_32/G_3970/c0/state0/prev_reg[5]_1
SLICE_X40Y122 LUT6 (Prop_B6LUT_SLICEM_I5_O)
0.039 1.946 r W_32/G_3970/c0/state0/mul_uint8_i_3/O
net (fo=2, routed) 0.249 2.195 W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/A[5]
DSP48E2_X3Y50 DSP_A_B_DATA r W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
------------------------------------------------------------------- -------------------
(clock clk rise edge) 5.000 5.000 r
0.000 5.000 r clk (IN)
net (fo=412, unset) 0.044 5.044 W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/CLK
DSP48E2_X3Y50 DSP_A_B_DATA r W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/CLK
clock pessimism 0.000 5.044
clock uncertainty -0.035 5.009
DSP48E2_X3Y50 DSP_A_B_DATA (Setup_DSP_A_B_DATA_DSP48E2_CLK_A[5])
-0.307 4.702 W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST
-------------------------------------------------------------------
required time 4.702
arrival time -2.195
-------------------------------------------------------------------
slack 2.507
The fo=21 number here specifies that the fanout factor of W_32/G_3970/c0/state0/out_reg[0]/Q is 21. Again, this information should be extractable from a parser pretty easily. More information, like the split between the path delay and routing delay can be understood by reading this ChatGPT transcript.