Rethinking Static Compilation

Open calebmkim opened this issue 7 months ago • 2 comments

(Some of this post is duplicated from my lab notebook entry).

Recap of Problem

Suppose we have the following control structure (assume all groups take 1 cycle):

static seq {A; B; static repeat 10 { <body with latency N> } C; D; }

FSMs are currently implemented like this:

parent FSM: 0->1->2->3->4 // continue to count   -> 2 + (10*N) + 1 -> etc. 
child FSM:        0->1->.....->N->0->1->2.... // will repeat 10 times

As you can see, because the parent does not stop counting when it offloads to the child, it can drastically increase the size of the parent register.

I implemented FSMs so that the parent pauses when it offloads to the child:

parent FSM: 0->1->2  // parent pauses instead of counting  -> 3 -> 4 etc. 
child FSM:        0->1->.....->N->0 // child will repeat 10 times 
iteration FSM:    0->0->0->.....->1  // iteration FSM holds the number of iterations that have occured.

Note that, unlike with the existing compilation technique, we need an iteration FSM to count the number of iterations of the repeat loop (the existing compilation technique doesn't need to do this: it can look at the parent FSM to know when to stop).

Results

Doing this new technique has led to some pretty good resource usage and worst slack results (compare the red line with the orange bars on these graphs: https://github.com/orgs/calyxir/discussions/2202#discussioncomment-10038812). But the tl;dr is that if we intelligently handle static-par blocks (see below) we can get results that are pretty clearly better than the existing compilation technique.

Challenge: compiling `static par` (warning: lots of low-level details)

One of the really nice things about compiling the existing compilation technique is that static pars are really easy to compile. Suppose we have the following control structure:

static par {
  static seq { <something that takes 5 cycles> static repeat 5 {<body takes 10 cycles>} }
  static seq { <something that takes 22 cycles> static repeat 10 {<body takes 5 cycles>} } 
}

The FSM will look like the following:

parent FSM: 0->1->2...5->6->7->.....................22->23->24->....................68->69->70
child FSM             0->1->...5->0->1->.... // repeats 10 times 
child FSM                                            0->1->...10->0->1->.... // repeats 5 times

The nice thing is that each child FSM easily knows exactly when to start since it can refer to the parent FSM. For example, the first child needs to start on cycle 5, so it starts when the parent FSM equals 5. The second child needs to start on cycle 22, so it starts when the parent FSM equals 22.

Using the new technique:

parent FSM: 0->1->2...5... // parent stops at 5 
child FSM             0->1->...5->0->1->.... // repeats 10 times
iteration FSM         0->0->0->...1 // iteration counts the iteration number
  
child FSM             // I need to start on cycle 22... when do I start?

The challenge is knowing when the second child should start (and this problem obviously generalizes to the third, fourth, etc. children if they exist too). You can't just tell the second child to start when the FSM==22, since that isn't actually the 22nd cycle (becasue of the FSM stopping at 5 to offload to its children).

The way I see it, there are two options. At first, I thought of trying to go for option 1, but then quickly switched to option 2.

Option 1: having a "reference" thread.

(This is NOT my chosen implementation, so you should skip this if you only care about what I actually implemented and not my decisions for why I wanted to implement things the way they are.)

Perhaps the first instinct is try a compilation technique that is most similar to the existing technique (i.e., having a single FSM that all children can read from). The group of {parent FSM, child FSM, iteration FSM} can serve the same function as the previous parent thread. So for our example, the second child FSM will be activated from cycles 22 through 70. Unlike the existing technique, in which the guard is a simple 22 <= parent fsm < 70 the guard is going to be a bit more complicated.

We need to translate the query %[22:70] into an fsm guard. We'll start with

parent_fsm >= 6 | <some other condition>

parent_fsm >= 6 covers cycles 55->70.
But we still need to fill out <some other condition> cover cycles 22->55. This corresponds to when the parent fsm is offloading to the child FSM.

parent_fsm >= 6 | parent_fsm == 5 & <some condition involving the child/iterator>

Now we've covered when the parent_fsm is offloading, but we still need <some condition involving the child/iterator> to capture the start condition exaclty at cycle 22 (corresponding to cycle 17 of the child). Let's add that:

parent_fsm >= 6 | parent_fsm == 5 & ((iteration_fsm == 3 && child_fsm >=2) | iteration_fsm >=4 )

Basically, we need the condition to start being true on the 2nd cycle 4th iteration (i.e., when iteration_fsm==3 and child_fsm==2), and continue to be true for the remaining iterations (i.e., when iteration_fsm >=4).

Obviously this type of complicated guard will hurt resource usage and probably worst slack as well. And if the loop is nested deeper than the single loop in this example, this guard can get even more complicated.

One way to mitigate this would be to having a one bit register guard_reg that sets itself high when parent_fsm == 5 & ((iteration_fsm == 3 && child_fsm ==1) (child_fsm ==1 because it takes one cycle to write to the register, and we want it to become high when child_fsm==2) and then guarding with guard_reg.out instead... but I'm not sure if this would help or not.

I ended up not implementing things using this technique a) because I thought it would be too difficult to implement and b) it might not even be the best in terms of resources/worst slack. But I think it's possible that this actually may be the best choice.

Option 2: creating a new parent FSM for each thread

Basically, we create a new parent FSM for each thread in the static par block. Similar to how we compile dynamic par blocks, where we basically fire off each thread independently, and then wait for them all to finish.

Kind of like this:

parent FSM: 0->1->2...5... // parent stops at 5 
child FSM             0->1->...5->0->1->.... // repeats 10 times
iteration FSM         0->0->0->...1 // iteration counts the iteration number
  
parent FSM: 0->1->2->3....22... // parent stops at 22 
child FSM                  0->1->...10->0->1->.... // repeats 5 times
iteration FSM              0->0->0......1 // iteration counts the iteration number

This will obviously increases register usage, but I'm hopeful that it will decrease logic (LUT) usage since the guards should be simpler.

Also, this is easier to implement since "fire off each thread and wait for them to finish" can be implemented using the same offloading abstraction used to offload the static repeat bodies (see next section).

What IR abstractions are useful for implementing this?

This FSM generation can get tricky, so you clearly need some sort of abstraction to help you (for example, none of my examples so far even discussed nested loops, which adds another layer of complication). I settled on a tree structure (although I don't know if that's the best abstraction). This makes handling nested loops much easier (one layer of nesting in the loop <-> one layer in the tree).

Each node in the tree has a body latency and number of repeats associated with it, and each child represents an FSM that the parent FSM has offloaded execution to.

For example, if we had the following control:

static seq {<take 10 cycles>; static repeat 10 {  <body latency 5> } <take 10 cycles>; static repeat 5 {  <body latency 10> }}

We'll get the following tree: Screenshot 2024-07-13 at 11 01 22 AM

What abstraction should we use for `static par`?

The one weird thing about the abstraction is that each child can be a 1) a single tree or 2) a group of trees (this is for static par blocks). The specific designation of "a group of trees" is so that I can distinguish between children that need to execute at the same time vs. children that do not execute at the same time.

So for example if you had:

static seq {<take 10 cycles>; static par {static repeat 10 {<body latency 5> } static repeat 5 {  <body latency 10> }}}

The tree would look like: Screenshot 2024-07-13 at 11 01 33 AM

Code

The code is on the static-repeat-compilation branch: https://github.com/calyxir/calyx/tree/static-repeat-compilation. It is very messy and not at all readable. Honestly I would not recommend trying to read the code. I will work on improving readability this week.

Jul 13 '24 15:07 calebmkim

calyx calyx copied to clipboard

Rethinking Static Compilation

Recap of Problem

Results

Challenge: compiling static par (warning: lots of low-level details)

Option 1: having a "reference" thread.

Option 2: creating a new parent FSM for each thread

What IR abstractions are useful for implementing this?

What abstraction should we use for static par?

Code

calyx
calyx copied to clipboard

Challenge: compiling `static par` (warning: lots of low-level details)

What abstraction should we use for `static par`?