calyx
calyx copied to clipboard
Inliner revamp
The component inlining pass has a bunch of limitations:
- Cannot inline instances that are invoked multiple times with different arguments
- Cannot inline
invoke
-with
clauses
These come from the original implementation (https://github.com/cucapra/calyx/pull/829) which wanted to reduce the amount of IR copying. Towards the end, the copying was needed anyways and so the current implementation makes the wrong trade-off.
We should revamp it to copy things as much as needed to support the two cases above. Specifically, we should make use of the ir::Rewriter
infrastructure to compute how to rewrite the groups and the control program.
@bcarlet mentioned the potential for implementing more heuristics to automate the decision of inlining more in the future as well. @andrewb1999 mentioned that one of the big missing optimizations in HLS tools is managing FSM sizes. Both of these are unlocked by reimplementing the inliner to be more general.
Just as a reminder for why making inlining work more generally is a good idea, it makes all the other optimizations so much more powerful. For example, see the results from @calebmkim's paper:
Fully inlining + resource sharing reduces usage by half!
(Edited to a more tricky example) Trying to implement this, and I'm trying to think of possible edge cases.
Consider the following example:
component child (in: 32) -> (out: 32) {
cells {
add = std_add(32);
}
wires {
// continuous assignments assigning to add.left and add.right
add.left = in;
add.right = in;
}
...
}
component main () {
cells {
child_instance = child();
}
...
control {
invoke child_instance(in = 32'd2)();
invoke child_instance(in = 32'd3)();
}
}
How would we inline these invoke child_instance
? This is similar to an invoke child_instance()() with comb group
where comb group
performs the addition (we've just moved the comb group
to be continuous assignments inside of the child
component).
The problem with this scenario is that we need add.left = 32'd2; add.right = 32'd2;
to be active exactly for the control
of the child_instance
. To do this, we would need something like a generalized with
statement (i.e., you can attach with
to any control statement).
Yes, this is tricky! If we think about the kind of circuit that is generated when everything is compiled down, we'll end up muxing the inputs to the in
signal of child
. Maybe the trick here is that when inlining continuous assignments from a component, we need to generate some muxing logic along with the right signaling infrastructure.
Bigger picture question: should Calyx allow groups to abstractly refer to the control signals that will eventually be used to implement the surrounding FSMs? Is there a nice and composable way of doing that?
Thanks for sketching this up, @calebmkim.
To take your discussion one step farther (envisioning what this would look like with the much-maligned "general with
" statement)… first, correct me if I'm wrong, but don't continuous assignments have the "magical" property that they are always active, even when the component that contains them isn't running (i.e., its go
hasn't yet been set to high)? If that's the case, for continuous assignments in particular, it suffices to just copy them from the callee to the caller. So the inlined version could look like this:
component main () {
cells {
child_instance_add = std_add(32);
child_instance_in = std_wire(32);
}
wires {
child_instance_add.left = child_instance_in.out;
child_instance_add.right = child_instance_in.out;
comb group child_instance_invoke1 {
child_instance_in.in = 32'd2;
}
comb group child_instance_invoke2 {
child_instance_in.in = 32'd3;
}
}
control {
with child_instance_invoke1 {
// copy of child control
}
with child_instance_invoke2 {
// identical copy of child control
}
}
}
Is that right? Maybe this is exactly the same as your proposal, differing only in that I generated a std_wire
for the input port to make the copying of the continuous assignments a little simpler.
Anyway, it's pretty weird to imagine how to "fake" this (skip the fictional "general with
" and get the behavior we are all imagining without it). @rachitnigam's suggested approach, to somehow expose the control signals directly to the structural part of the program, could perhaps work but would require a creative idea w/r/t how to do this in general… maybe one way of looking at this is that port assignments in invoke
are already working quite a bit like a generalized with
and it is hard to get this functionality without invoke
.
An extremely dumb idea would be to create a 1-bit register for each invocation that holds 1 when the invocation is active… this would waste a cycle to set this register, but it would maybe be a stopgap?????
correct me if I'm wrong, but don't continuous assignments have the "magical" property that they are always active, even when the component that contains them isn't running (i.e., its go hasn't yet been set to high)?
Yes, you're right (so it was a bit misleading for me to compare it to an invoke with comb group
statement, since they're not exactly the same. I suppose my point was that this example presents a similar challenge to what inlining an invoke with comb group
statement would do.)
Is that right? Maybe this is exactly the same as your proposal, differing only in that I generated a std_wire for the input port to make the copying of the continuous assignments a little simpler.
Yes, we're on the same page. Your example is pretty much exactly how I would imagine inlining this example if we were to somehow suddenly support a generalized with
operator.
An extremely dumb idea would be to create a 1-bit register for each invocation that holds 1 when the invocation is active… this would waste a cycle to set this register, but it would maybe be a stopgap?????
This sounds like it should work-- but it seems like it would have to waste two cycles, since we would have to set the register to 0 after the invocation, right? Also, if we were to do this, we could use this trick to solve the problem for invoke with comb group
statements too (the comb
group assignments can be moved into continuous assignments that are only triggered during the invocation).
However, there's a separate question of whether we would even want to implement something like this, given this latency penalty.
Here's one proposal: we could do some really simple check before doing any inlining: if there is only ever one set of invocation arguments for a given instance (and never any invoke with comb
), then we do not need to write to the register (we can just inline using the current technique). If there are multiple "sets" of invocation arguments (or comb
groups), then we will use this "write to a 1-bit register" technique. Obviously, this is still a stopgap, but maybe an ok one?
Anyway, it's pretty weird to imagine how to "fake" this (skip the fictional "general with" and get the behavior we are all imagining without it).
"Faking" it sounds interesting but vague right now; I might spend some time thinking about what it would look like concretely--it may turn out that it's not possible.
Yeah, you're absolutely right about wasting another cycle to set the special 1-bit register to 0. And it's a good idea to special-case the one-invocation case to avoid needing this…
I fully support more thinking about how to "fake" the "general with
", although I totally agree there's a good chance it's impossible. Maybe it's time to reconsider the general with
again…
After thinking about it, the general with
statement seems like it is (probably?) the most natural way to solve our problem.
Here is a summary of my thinking:
- We need some way of being able to activate assignments for exactly some control block.
- We run into problems with dynamic control, since the compiler is free to spend as much time as it wants in between groups, for example. We still need to activate the assignments during this "in between" time, even if no groups are currently executing. Therefore, in order for these assignments to be active, they must be continuous assignments. The problem with continuous assignments is that they are too coarse-grained to let us activate assignments for exactly a control block. The only solution I could think of is continuous assignments that are something like this:
child_instance_in.in = // magical guard that is active for a specific control stmt ? 32'd2;
child_instance_in.in = // magical guard that is active for a different specific control stmt ? 32'd3;
But it seems to me that this is basically the same thing as a generalized with
operator.
This leads to the question: what are the downsides of a generalized with
operator.
One is that we are going to have to think about how it affects all of our existing optimizations (e.g., liveness analysis) and we will have to deal with it in future analyses. What are the other downsides?
Synchronous Meeting Summary
Why this problem matters
- Inlining is a very important optimization, since it is what makes other optimizations possible (e.g., the graphs at beginning of thread). So it is worth it to try to make it really aggressive.
- The 2-cycle latency penalty is not a viable long-term solution for inlining
Solution to this problem
The solution we have come up with is to explore the general with
. The reason why is the following: we need some way for assignments to be active exactly for some control (e.g., child_instance_in.in = 32'd2
for the above design). These assignments either have to be placed in the general with
or in continuous assignments (some sort of asymmetric par
is also another solution, but general with
seems to be a simpler solution).
The main downside to a general with
is that it makes analysis more difficult because it introduces a bunch of conflicts in assignments. But continuous assignments do this too, and they introduce even more conflicts! In other words, general with
is actually more informative for analyses than continuous assignments.
Next Steps
- Look at which passes/analyses have to change with a generalized
with
. I'll report back on if these changes are all doable. - If we determine it is doable, then the next step is to try an implement it. It should be simple (hopefully) to implement-- you just guard the
comb
group assignments with whichever fsm states the control is active (we do this in tdcc). For example,child_instance_in.in = 3 <= fsm.out <= 7 ? 32'd2
.
(Note that this entire problem is only for dynamic invokes
. Static invokes are easy to inline, since you can use a static par
that activates the continuous assignments for exactly the length of time you want them to).
After looking through the codebase for a bit, I think a generalized with
would be doable. From a quick glance, the things that need to change are the following.
- Domination Analysis: I think
with comb_group { body }
would be treated somewhat similarly to how we treat apar
blocks. (i.e., we can almost treat it likepar {comb_group; body }
: neithercomb_group
nor anything inbody
dominate each other, but both are guaranteed to execute (which makes them different than something likeif
branches). - Live Range Analysis: similar to Domination Analysis, we can treat them kind of like
par
blocks. - Compaction: the function that builds the dependency graph for a given
seq
takes acontinuous_assigns
parameter, which helps conservatively insert dependencies that respect continuous assigns: forseq
that exists within (possibly nested)with comb_group
statements, we just have to add the assignments tocomb_group
as part of thecontinuous_assigns
. -
dead-assignment-removal
will get more complicated, since we have to look at the control instead of locally reason about groups. (We can always "cheat" a little bit if we wanted to by being overly conservative, and treat all comb group assigns like they are continuous assignments).
There are probably some other minor things that need to change, but I think it's definitely doable if we want to take a shot.
Thanks for the analysis @calebmkim! We should also define some optimizations that shrink the scope of with
before these optimizations kick in so as to reduce its overall impact. A couple of rewrites that come to mind:
- If control program is not transitively affected (this is important because it includes continuous assignments) by a
with
, it can be moved out of thewith
scope.- For
with c { x; y; z }
ify
is not affected byc
, then we rewrite it towith c { x }; y; with c { z }
- For
- If we have
with c { x }
wherex
is a group, we can inline the combinational assignments into the group.
Reducing the scope of with
guards is good because it also enables us to generate simpler guards for the resulting programs hopefully?
Yep, this all makes sense-- it seems like optimizations to reduce the scope of with
would help.
If we have with c { x } where x is a group, we can inline the combinational assignments into the group.
I'm not quite sure this would necessarily help generate simpler guards (it will probably make the program easier to analyze)---wouldn't the assignments in c
have the same guard either way (just guarded by the fsm state corresponding to x
)?
Another Question
Another question: would we want to support a static with comb_group {body}
statement?
One argument towards "no" is the following: we can just use a static par {body; comb group}
.
Also, during static-promotion
, we can just directly promote with comb_group {body}
-> static par {body; comb_group}
.
This allows us to avoid the extra work of supporting an extra operator.
On the other hand, there is a weird asymmetry to this, and it's probably not too much work to just add a static with
operator.
par<n> { group; comb group }
is pretty weird. I see why this works (because you know exactly how long group
runs for). Do we actually need a static with
instead of just allowing with
to exist in either context?
Maybe we need to redesign the data structures a bit so that we don't stratify static
and non-static
control ops and then just error out when non-static is nested in static
.