calyx
calyx copied to clipboard
Smart seq split
Similar to duplication, the goal with splitting a seq block is ultimately to reduce fanout from the one register that normally controls a sequential schedule. Initially, we tried to transform:
seq { A; B; C; D; E; F; }
into:
seq {
@new_fsm seq {A; B; C;}
@new_fsm seq {D; E; F;}
}
The children schedules generated control registers like so:
@generated fsm0 = std_reg(2);
@generated fsm1 = std_reg(2);
which was the goal, except, with this, the following groups and assignments would be generated:
group tdcc0 {
A[go] = ...
...
fsm0.in = fsm0.out == 2'd0 & A[done] ? 2'd1; // line 1
...
}
group tdcc1 {
D[go] = ...
...
D[go] = fsm1.in = fsm0.out == 2'd0 & D[done] ? 2'd1; // line 2
...
}
Lines 1 and 2 are pretty similar; they each check the current state of the FSM register and ensure some other group has finished, and then they each update their register's value with a new value (the same value for each register!). Once we got synthesis results from this "new-fsm-insertion" method, we saw that WS decreased and LUT usage increased; we suspected it had something to do with the fact that we were duplicating the logic to transition FSM states, since both fsm0 and fsm1 have identical transition conditions (up to the names of the groups they are controlling) and new values.
So, we decided to open up an option in TDCC that lets a seq block be controlled by a parent register, a child register, and duplicated versions of each register (that agree with their respective original at each cycle). We can hopefully, therefore, get the benefits of reducing fan-out, while also ensuring that we reuse logic wherever we can. Here's what the tdcc group will look like for the above control block:
group tdcc {
A[go] = !A[done] & parent0 == 1'd0 & fsm0 == 2'd0;
... // notice how the enable queries are split among the two sets of registers. that's for fan-out reduction
F[g] = !F[done] & parent1 = 1'd1 & fsm1 = 2'd2;
... // notice how transition logic is shared and not duplicated
fsm0.in = A[done] & parent0 == 1'd0 & fsm0 == 2'd0 ? 2'd1;
fsm1.in = A[done] & parent0 == 1'd0 & fsm0 == 2'd0 ? 2'd1;
... // transition logic for parents isn't shown
}
In short, the idea is duplication, but with an emphasis on making sure the registers reuse logic to update themselves. Benchmarking + synthesis results are in progress.
@parthsarkar17 what is the status of this PR and what is blocking it from being merged?
@rachitnigam, @calebmkim and I were deciding whether this splitting technique was "better" than duplication or not, given that they were pretty similar. We were also considering separating the idea of offloading onto a child register from the idea of creating two, identical children and parent registers (which is what this branch's code does).
Originally, the idea would be that for seq schedules, "smart splitting" would always be better than duplication. But, that benefit isn't clear right now (but there are benefits). For now, I think what could be good is to introduce another tdcc option for "offloading with duplication" and then merge this branch. I'll get on that
@parthsarkar17, that makes sense! I think we should perhaps consider making a page in docs.calyxir.org that describes the various options and links to the experiments? My long-term worry is that all of these options will be impossible to understand without the context of specific experiments and rationale around them.
Yeah, agreed, it seems like it can quickly get out of control. That's a good plan! I'll get on it
Thanks for the PR + detailed explanation. One high level question: it seems like this PR is doing two things to splitting:
- It is duplicating the registers (both the parent and children)
- it is using the same register(s) for both children, AFAICT
I'm wondering where the benefit is coming from? Is it (1), (2), or a combination of both?
To explain what I mean, I've attached a drawing to show how I understand of each thing (please correct if I'm misunderstanding): each color is a register (if a color appears twice, then that means the same register is being used): this is building off your example, lmk if it makes sense.
Before merging this, I think having a comparison of performance for (1) vs. (2) vs. (3) vs. (4) would be helpful. Sorry for being so picky; I just want to avoid the problem you guys have already identified: these options are becoming too difficult to understand without specific context; it's also getting difficult to understand where the benefits in performance are coming from.
Lmk if my drawings make sense, or if you want to have a synchronous discussion.
@parthsarkar17 are we still planning to merge this PR?
This pull request has not seen activity in 14 days and is being marked as stale. If you're continuing work on this, please reply and state how to get this PR in a mergeable state or what issues it is blocked on. If the PR is not ready for review, please mark it as a draft. The PR will be closed in 7 days if there is no further activity.
This stale PR is being closed because it has not seen any activity in 7 days. If you're planning to continue work on it, please reopen it and mention how to make progress on it.
@parthsarkar17 marking this as a draft so StaleBot doesn't close it for now
This pull request has not seen activity in 14 days and is being marked as stale. If you're continuing work on this, please reply and state how to get this PR in a mergeable state or what issues it is blocked on. If the PR is not ready for review, please mark it as a draft. The PR will be closed in 7 days if there is no further activity.
@parthsarkar17 thoughts ?
This is probably too far behind to be worth merging. If it works better, I just copied the changes locally and can close this PR.