calyx
calyx copied to clipboard
Condensing seq blocks
Discussed this in meeting w/ @rachitnigam today. Was told to tag @sampsyo as well.
The basic idea for this pass is that, for each child C of a Seq, we try to find a safe place in a Par block to insert C that will save time. This will mean looking for a place where you know at least one child of a Par block is "waiting" for another child to finish executing, and inserting C to create a Seq with the "waiting" child. Of course we have to be careful of the dependencies of each child when doing this.
For example: Suppose that we know that C1 takes 10 cycles, and C2, C3, and C4 all take 2 cycles.
Suppose we have:
seq{
par{ C1; C2}
C3;
C4
}
Currently, we know C2 will be "waiting" for 8 cycles, since it take 2 cycles versus C1's taking 10 cycles. If we know that C3 and C4 depend on C2 but not C1, then we can transform this into:
par{C1; seq {C2; C3; C4} }
Or, if we know that C4 depends only on C2, C3 depends only on C1, and that C4 does not depend on C3, then we can turn this into:
par{seq{C1; C3}; seq{C2; C4}}
So, if we want to move a child of a Seq block "back" into a previous Par block, then we have to check which of the previous statements it depends on.
Alternatively, if we have:
seq{ C1; C2; par {C3; C4}}
And we know C3 only depends on C2, and both C4 and C3 depend on C1, we could transform this into:
seq {
C1;
par {seq{C2;C3}; C4}
}
In this scenario, we are moving C2 "forward" into a future Par block, so we have to check which of the future statements depend on it.
Thinking about it, even if we had something like
seq {C1; C2}
If we knew C1 and C2 did not depend on each other then we could get:
par{C1; C2}
Although I'm not sure if this transformation is already implemented elsewhere.
The advantage of this pass would be to save time. In this example, the number of cycles would go from 14 to 10.
Very interesting! I like the idea of going for the more general par
/seq
rearrangement as a medium-term goal. At the end, though, you mention a drastically simplified version, which I think could work well as a good first step (which will require solving a good chunk of the technical problems anyway): just transforming seq
into par
when it's safe to do so.
(This would be the opposite of the existing par_to_seq
pass.)
Focusing on this basic version first will let us figure out the dependency issues in a simpler setting—and provide a foundation to build on to deal with fancier par
/seq
nests. IMO that latter phase would do well to be driven by real benchmarks so we know we're addressing real patterns that come up in practice.
Just a passing thought: there seems to be some general philosophy here about transforming programs to localize groups to parallel thread that affect subsequent groups. In some ways, all programs can be thought of a bunch to parallel threads running only some of which affect each other. I suspect once we have synchronized parallelism, we can transform all programs to remove nested parallel and synchronize only when needed.
@calebmkim was this implemented in the last PR you opened?
No, the most recent PR changes LiveRangeAnalysis and MinimizeRegs so that the MinimizeRegs pass does what the previous MinimizeRegs and ResourceSharing passes both did.
I meant the one before it
Oh, sorry. But in that case again answer is no. The pass only dealt with transforming par blocks.
@calebmkim @paili0628 do you think this would a good pass to show off with the new static
abstractions? If so, let's add it to the new static milestone
Yeah, I think this could be helpful. But perhaps not necessarily as urgent (like we can wait until after we have the frontends settled), since I'm not sure how often this optimization will actually be useful.
Got it! It seems to fall into the bucket of various passes that discover parallelism from your program so maybe it should be wrapped in the broader umbrella of a complex control
optimization pass that we've talked about on and off
Subsumed by #1722