calyx
calyx copied to clipboard
Reducing control fan-out
Control signals in Calyx programs often have high fan outs. For example, in a par statement with n children, the go signal to the corresponding group to implement the control will have a fan-out of n.
With the ntt pipelines that execute a lot of simple groups in the same par block (and even larger systolic arrays), this will quickly become a problem.
A possible solution is trading off latency for fan-out by inserting registers to forward the control signals:
- Given a
parblock withnchildren, instantiate two control registers and connectpars go signal to the register. - Partition the children into two groups of
n/2control statements. Each half of control statements get their go signal from one of the two registers.
This slows down the par block by one cycle but reduces fan-out by a factor of two. In general, given a maximal fan-out of m (specified by attribute, target, or compiler flag), this pass can use log_m n more cycles to break up the control flow signal.
This is partly inspired by conversation with @zhangzhiru on control pipelining. The problem is harder because they need to forward the signal within the context of pipelines. However, I think this could be a good base of giving frontend or compiler toolchains a way to guarantee synthesizability of Calyx designs.
This is a good idea IMO. Just a couple of disconnected thoughts:
- Add this to the list (with resource sharing & register minimization) of passes that ideally want some sort of technology-specific cost model as a heuristic guide. Especially if the "register tree" involved needs a configurable width & depth.
- Like some of those other passes, it would be nice to have a way to assess the need for it by measuring something about the unoptimized design. Are there convenient ways to find bad fan-outs in a netlist and to blame synthesis/timing failures on them? (I don't know the answer to this.)
Experiment to see if fan-out problems can be fixed using this:
- Write generator for generating high fanout programs (with some variable to control fannout)
- run synthesis until this fails timing (with increasing variables)
- see if increasing nesting fixes this
We also realized that we don't need a whole new pass to do this. We can just disable/undo the effect of collapse-control and let the compilation for par { par { ... }; par { ... } } generate the additional structure.