calyx Reduce `par` threads when latency penalty is known

Reduce `par` threads when latency penalty is known

Open rachitnigam opened this issue 1 year ago • 1 comments

When tdcc compiles a par block, it allocates a new FSM for each thread:

par { A; B; C; }

Each sub-program gets its own independent FSM to ensure that threads can make progress independently. However, sometime, we can have programs that look like this:

par {
  while cond { B };
  upd_reg;
}

In this case, upd_reg takes 1 cycle and the loop may take thousands. Regardless, we still allocate a whole new FSM for upd_reg. It would be better to just transform this into a seq instead and use exactly one FSM. The challenge is that, in general, we don't know how long a "simple looking" control program will take; after all, a loop is compiled into a group at some point.

Instead, we should use the newly added @promote_static(n) attribute to detect when a group (which wasn't upgraded to a static island), takes a small fraction of the cycle-time of the other threads and instead sequence it with one of the threads. We can expose a compiler knob to decide what the exactly fraction should be but the upshot is that this will enable us to reduce the number of FSMs we allocate.

It also occurs to me that the static inlining pass should annotate the generated group with a @promote_static(n) attribute so that this information can be used to reschedule par threads.

Dec 27 '23 00:12 rachitnigam

@calyxir/static-calyx this is another example of how the static extensions help the overall compiler pipeline. We should implement this for the camera ready and brag out resource benefits we get.

Dec 27 '23 00:12 rachitnigam

calyx calyx copied to clipboard

Reduce `par` threads when latency penalty is known

calyx
calyx copied to clipboard