calyx Problems with static FSMs

The top-down-st pass attempts to generate one big, top-level static FSM for programs. However, currently, it has some outright bugs in the presence of conditionals and parallel threads.

Conditional & Loops

For the following program:

zero;
@bound(2) while cond.out {
  if cond0.out { one; } else { three; four }
}

zero, one, and four are the predecessor states for one and three. However, the FSM generated does not have the correct transitions. Assuming each group takes one cycle, it executes one or three in cycle 1 but fails to add a transition to cycle 2 in the case the true branch is executed.

Parallel Threads

The FSM generated currently only transition forward because loops are fully unrolled. However, when attempting to #939, I realized that having any form of back edges will break the FSM for the rest of the threads since they might be expecting different transitions.

Solutions

I've been working on a PR that enforces the following:

if always takes the max(tru, fal) cycles to execute. This may make static timing worse in the case where one branch is much longer. However, it allows rest of the control operations to not worry about having multiple preceding control edges.
par, in presence of while loops, requires separate FSMs because loops require back edges in the control graph.

Mar 07 '22 18:03 rachitnigam

Heads up @mikeurbach: if the CIRCT stuff is generating conditionals with static annotations, it likely to be wrong IMO

Mar 07 '22 18:03 rachitnigam

One thought I had about if: if we are able to handle back edges in the control graph (for par, in the presence of while), could a similar approach handle inbalanced static ifs? I guess I'm trying to think about this idea of separate FSMs, and if it will also help with if.

Another thought I had along these lines: what if by default, every static control node had a dedicated FSM. Sharing FSMs could be an optimization that is done when safe and advantageous.

Anyway, curious what your current thinking is on this issue. I would like to help a) make the compiler correct, and b) allow it to output more "efficient" hardware when static information is known.

Mar 09 '22 21:03 mikeurbach

So, the previous version of this pass actually did exactly that. If you’re curious, take a look at the static_timing.rs file under passes in the source tree.

The problem, or rather, the question for me was could top down static timing generate much fewer FSMs and better overall circuits. I guess the current implementation shows that there is a trade off to be made here and we can’t just get smaller FSMs for free

Mar 09 '22 21:03 rachitnigam

I think we might be able to get away with one FSM per par thread while handling back edges and if edges without balancing. This is the same restriction tdcc works with.

Mar 09 '22 21:03 rachitnigam

So, the previous version of this pass actually did exactly that. If you’re curious, take a look at the static_timing.rs file under passes in the source tree.

The problem, or rather, the question for me was could top down static timing generate much fewer FSMs and better overall circuits. I guess the current implementation shows that there is a trade off to be made here and we can’t just get smaller FSMs for free

Awesome, good to hear there is prior work here, and I am not way off base.

I think we might be able to get away with one FSM per par thread while handling back edges and if edges without balancing. This is the same restriction tdcc works with.

That makes intuitive sense to me!

Mar 09 '22 21:03 mikeurbach

@rachitnigam I thought about this some more, and I think one possible point in the design space is to keep seq and par as-is (count up through the child states), and split off the FSMs for if and while.

For if, it might be possible to use one FSM counter, sized based on the larger child. It could count up to the appropriate value depending on which child is selected at runtime. I think this brings up the prior discussion about when the condition is expected to be valid. But we should be able to decide and then count up appropriately. We could also potentially generate separate FSMs for each child, and select one to count up at runtime. To join the children, wait for the selected counter to reach the limit.

For while, I think we can move the current logic into a separate FSM, and join it by waiting for the counter to reach the bound. I think this gets a little trickier, since the ifs might be taking different numbers of cycles at runtime.

I think this starts to blur the "static" lines, because we are sensitive to the latency of the different arms of if ops, and I don't see a way around that with simple counters that increase. So I guess I'm suggesting to outline chunks of the control program that can be a trivial counter that increments by 1, and treat the connections between such counters with some latency insensitivity.

Also- I have started thinking about "static timing" in terms of counters like this. If that mental model is making sense, we can probably update how we store state in the Schedule to be closer to this idea.

Let me know if you have any thoughts or work in progress on this, I might be able to take a crack at it this weekend.

Mar 11 '22 03:03 mikeurbach

Hey @mikeurbach, I've been doodling on a similar line of thought. Each one of these design points has a different trade-off in the kinds of programs it efficiently compiles. Just to summarize, our options are:

Different FSMs for each par thread and internal FSMs can skip transitions and use backedges. This makes sense when each "thread" is a big control program with non-linear control flow but wastes resources when the threads all have simple control flow.
One global FSM with non-linear control defining separate counters. This makes sense in the opposite case to (1)–when threads a simple, this will do well; when threads a complicated, it will generate too many such counters and possibly use more space than "one FSM per thread" model.
Forced Balanced FSMs (#943), back edges for while, separate FSM for par: Given that we have to separate out FSMs for par, (1) seems strictly better.

I think either (1) or (2) could be a good starting point. In the ideal case, we would have a heuristic that automatically switches between the two based on some metric of "thread complexity" in par blocks. If you envision control programs to remain mostly simple within par blocks, then go for (2). Otherwise (1) might simplify debugging and explainability of generated FSMs.

Mar 11 '22 14:03 rachitnigam

If you're going to work on it this weekend, keep an eye out on https://github.com/cucapra/calyx/pull/955 since it'll make compile-empty an optional pass for tdcc. I think another benefit of (1) is that you can mostly structure it the same as tdcc's implementation.

Mar 11 '22 14:03 rachitnigam

Makes sense, I do like the conceptual simplicity of the 1 FSM per thread idea. I am mainly thinking about pipelined loops, which right now are represented with something like while { par { state1; state2; etc. } }. I haven't gotten to loops with II > 1, but in that case, we'd have some guards like if state1_en { state1 }, but I don't foresee the control program getting much more complex within the thread for each pipeline stage. Anyway, I am in agreement that it would be good to trade off between different approaches.

A couple questions: for (1), you mentioned different FSMs, and for (2), you mentioned one global FSM, but with different counters. In my head, an FSM is a counter... what is the distinction that makes (1) have conceptually different FSMs, while (2) has one conceptual FSM? Also, for (1) were you imagining handling loops the way we currently do (i.e. unrolling), or is there something more to it?

Let me know if you make any more progress on your doodling and I'll do the same.

P.S. these questions about different versus global FSMs, and how to represent an FSM are really fun to hash out. Looking a bit down the road, I'm thinking about how we could represent these using CIRCTs (completely untested) FSM dialect. Once we've got one point in the design space nailed down in the Rust compiler, I want to try porting this to CIRCT to lean on the new FSM dialect and work out any kinks in its representations over there.

Mar 11 '22 16:03 mikeurbach

You’re right! Counters = FSM. I was just trying to use different terms to distinguish the two approaches. I think we can reasonably say that in (1), the number of FSMs are a function of the number of threads while in (2) they are a function of the number of control operations.

even in (1), when the threads are exactly one group, we’ll have no FSMs generated for them since we can directly use the global FSM to count up their states. Threads only need FSMs when they do non incrementing transitions.

I agree :). Hashing this out has been fun and I was also reading them FSM dialect Outline you wrote for compile control a few months ago to see if we can abstract some of these details. We might need some more complex notion of nested or multi state FSMs to encode these trade offs using one abstractions but figuring it out would be fun. Petri-Nets might serve as a better abstraction for these par thread and control handling FSMs

Mar 12 '22 13:03 rachitnigam

calyx calyx copied to clipboard

Problems with static FSMs

Conditional & Loops

Parallel Threads

Solutions

calyx
calyx copied to clipboard