circt
circt copied to clipboard
[SCFToCalyx] Lower SCF Parallel Op To Calyx
This patch lowers scf.parallel op to Calyx by invoking multiple calyx.components in parallel. I'm not pushing test files for now because some required features are not upstream yet.
But I can give an example, say we have:
module {
func.func @main() {
%c2 = arith.constant 2 : index
%c1 = arith.constant 1 : index
%c3 = arith.constant 3 : index
%c0 = arith.constant 0 : index
%alloc = memref.alloc() : memref<6xi32>
%alloc_1 = memref.alloc() : memref<6xi32>
scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
%4 = arith.shli %arg3, %c2 : index
%5 = arith.addi %4, %arg2 : index
%6 = memref.load %alloc_1[%5] : memref<6xi32>
%7 = arith.shli %arg2, %c1 : index
%8 = arith.addi %7, %arg3 : index
memref.store %6, %alloc[%8] : memref<6xi32>
scf.reduce
}
return
}
}
The output is:
module attributes {calyx.entrypoint = "main"} {
calyx.component @main(%clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
%c2_i32 = hw.constant 2 : i32
%c1_i32 = hw.constant 1 : i32
%c0_i32 = hw.constant 0 : i32
%mem_1.addr0, %mem_1.write_data, %mem_1.write_en, %mem_1.clk, %mem_1.reset, %mem_1.read_data, %mem_1.done = calyx.memory @mem_1 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
%mem_0.addr0, %mem_0.write_data, %mem_0.write_en, %mem_0.clk, %mem_0.reset, %mem_0.read_data, %mem_0.done = calyx.memory @mem_0 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
%par_func_0_3_instance.in0, %par_func_0_3_instance.in1, %par_func_0_3_instance.in2, %par_func_0_3_instance.in4, %par_func_0_3_instance.clk, %par_func_0_3_instance.reset, %par_func_0_3_instance.go, %par_func_0_3_instance.done = calyx.instance @par_func_0_3_instance of @par_func_0_3 : i32, i32, i32, i32, i1, i1, i1, i1
%par_func_0_2_instance.in0, %par_func_0_2_instance.in1, %par_func_0_2_instance.in2, %par_func_0_2_instance.in4, %par_func_0_2_instance.clk, %par_func_0_2_instance.reset, %par_func_0_2_instance.go, %par_func_0_2_instance.done = calyx.instance @par_func_0_2_instance of @par_func_0_2 : i32, i32, i32, i32, i1, i1, i1, i1
%par_func_0_1_instance.in0, %par_func_0_1_instance.in1, %par_func_0_1_instance.in2, %par_func_0_1_instance.in4, %par_func_0_1_instance.clk, %par_func_0_1_instance.reset, %par_func_0_1_instance.go, %par_func_0_1_instance.done = calyx.instance @par_func_0_1_instance of @par_func_0_1 : i32, i32, i32, i32, i1, i1, i1, i1
%par_func_0_0_instance.in0, %par_func_0_0_instance.in1, %par_func_0_0_instance.in2, %par_func_0_0_instance.in4, %par_func_0_0_instance.clk, %par_func_0_0_instance.reset, %par_func_0_0_instance.go, %par_func_0_0_instance.done = calyx.instance @par_func_0_0_instance of @par_func_0_0 : i32, i32, i32, i32, i1, i1, i1, i1
calyx.wires {
}
calyx.control {
calyx.seq {
calyx.par {
calyx.seq {
calyx.invoke @par_func_0_0_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_0_instance.in0 = %c0_i32, %par_func_0_0_instance.in1 = %c2_i32, %par_func_0_0_instance.in2 = %c0_i32, %par_func_0_0_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
}
calyx.seq {
calyx.invoke @par_func_0_1_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_1_instance.in0 = %c1_i32, %par_func_0_1_instance.in1 = %c2_i32, %par_func_0_1_instance.in2 = %c0_i32, %par_func_0_1_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
}
calyx.seq {
calyx.invoke @par_func_0_2_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_2_instance.in0 = %c0_i32, %par_func_0_2_instance.in1 = %c2_i32, %par_func_0_2_instance.in2 = %c2_i32, %par_func_0_2_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
}
calyx.seq {
calyx.invoke @par_func_0_3_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_3_instance.in0 = %c1_i32, %par_func_0_3_instance.in1 = %c2_i32, %par_func_0_3_instance.in2 = %c2_i32, %par_func_0_3_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
}
}
}
}
} {toplevel}
calyx.component @par_func_0_0(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
%true = hw.constant true
%std_slice_1.in, %std_slice_1.out = calyx.std_slice @std_slice_1 : i32, i3
%std_slice_0.in, %std_slice_0.out = calyx.std_slice @std_slice_0 : i32, i3
%std_add_1.left, %std_add_1.right, %std_add_1.out = calyx.std_add @std_add_1 : i32, i32, i32
%std_lsh_1.left, %std_lsh_1.right, %std_lsh_1.out = calyx.std_lsh @std_lsh_1 : i32, i32, i32
%load_0_reg.in, %load_0_reg.write_en, %load_0_reg.clk, %load_0_reg.reset, %load_0_reg.out, %load_0_reg.done = calyx.register @load_0_reg : i32, i1, i1, i1, i32, i1
%std_add_0.left, %std_add_0.right, %std_add_0.out = calyx.std_add @std_add_0 : i32, i32, i32
%std_lsh_0.left, %std_lsh_0.right, %std_lsh_0.out = calyx.std_lsh @std_lsh_0 : i32, i32, i32
%arg_mem_1.addr0, %arg_mem_1.write_data, %arg_mem_1.write_en, %arg_mem_1.clk, %arg_mem_1.reset, %arg_mem_1.read_data, %arg_mem_1.done = calyx.memory @arg_mem_1 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
%arg_mem_0.addr0, %arg_mem_0.write_data, %arg_mem_0.write_en, %arg_mem_0.clk, %arg_mem_0.reset, %arg_mem_0.read_data, %arg_mem_0.done = calyx.memory @arg_mem_0 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
calyx.wires {
calyx.group @bb0_2 {
calyx.assign %std_slice_1.in = %std_add_0.out : i32
calyx.assign %arg_mem_0.addr0 = %std_slice_1.out : i3
calyx.assign %load_0_reg.in = %arg_mem_0.read_data : i32
calyx.assign %load_0_reg.write_en = %true : i1
calyx.assign %std_add_0.left = %std_lsh_0.out : i32
calyx.assign %std_lsh_0.left = %in0 : i32
calyx.assign %std_lsh_0.right = %in1 : i32
calyx.assign %std_add_0.right = %in2 : i32
calyx.group_done %load_0_reg.done : i1
}
calyx.group @bb0_5 {
calyx.assign %std_slice_0.in = %std_add_1.out : i32
calyx.assign %arg_mem_1.addr0 = %std_slice_0.out : i3
calyx.assign %arg_mem_1.write_data = %load_0_reg.out : i32
calyx.assign %arg_mem_1.write_en = %true : i1
calyx.assign %std_add_1.left = %std_lsh_1.out : i32
calyx.assign %std_lsh_1.left = %in2 : i32
calyx.assign %std_lsh_1.right = %in4 : i32
calyx.assign %std_add_1.right = %in0 : i32
calyx.group_done %arg_mem_1.done : i1
}
}
calyx.control {
calyx.seq {
calyx.seq {
calyx.enable @bb0_2
calyx.enable @bb0_5
}
}
}
}
calyx.component @par_func_0_1(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
// same idea
}
calyx.component @par_func_0_2(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
// same idea
}
calyx.component @par_func_0_3(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
// same idea
}
The design choice is worth discussing together:
Re. why I created multiple FuncOps for the body of scf.parallel: There To invoke something in Calyx in parallel using par:
calyx.groupcalyx.componentThe former is not expressive enough for holding the body information ofscf.parallel; the latter can, and since we knowFuncOpwill eventually be lowered tocalyx.component, we can just invoke them in parallel in thecontrolsection by merely changing the operands passing to components.
Question:
Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.
@jiahanxie353 this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.
I have a few C++/coding related comments, but I'll postpone these since you're in draft mode.
So currently, you've created N components that have the same M cells, and are subsequently invoked in parallel in some main component, i.e.,
component main() -> () {
cells {
A0 = A0(); A1 = A1(); A2 = A2(); ...; An = An();
}
control {
par { invoke A0; invoke A1; invoke A2; ...; invokeAn; }
}
}
component A0() -> () { ... }
...
component An() -> () { ... }
The alternative is what you've questioned,
Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.
i.e., create a component that takes as input M cells. This should be possible, but would require wiring the correct inputs and outputs.
component main() -> () {
cells {
A0 = A(); A1 = A(); ... ; An = A();
}
control {
par { invoke A0(...); invoke A1(...); ...; invoke An(...); }
}
}
component A() -> () { ... }
this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.
The reason why I mark it draft for two reasons:
- The test cases can only be integrated when other PRs got merged first since we are missing some features here;
- The high-level design is not fixed since it'd be great to get some feedback from reviewers, and once we have decided a certain design choice (for instance, how many components to create, etc.), we can move it out of draft. With that said, the implementation for the current design is solid.
This should be possible, but would require wiring the correct inputs and outputs.
Makes sense!
Discussion needed:
After translating to Calyx, I found the issue of Calyx memory write_en driven by multiple sources, the root of which is memref.store %6, %alloc[%8] : memref<6xi32>.
This is because I'm passing the same memory by reference to different components. In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores. However, when lowering down to Calyx, we lost this semantics and only have
calyx.group @bb0_5 {
....
calyx.assign %arg_mem_1.write_en = %true : i1
....
}
across different component. And since it has multiple drivings, we can't lower it to Verilog.
Any thoughts?
Okay, the reasons for marking it a draft makes sense @jiahanxie353!
In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores.
It is not sufficient to show that we are going to write to different memory locations. The problem is that a hardware memory can only support a single read or write in a clock cycle (because it is a single-ported memory). For cases like this, you need to "bank" the memory into separate parts so that it access entirely different physical memories. First, I'd recommend reading the Dahlia paper to get a sense of what is going on. Next, we'd need to think more carefully about how this should be supported. @andrewb1999's work is adding this capability in a more general fashion but we might need to bank memories somehow.
Also tagging @ethanuppal since he's been thinking about similar problems in his language.
Is there a general algorithm to determine an "optimal" bank size for each memory?
Consider a somewhat made-up example with appropriate complexity:
module {
func.func @main(%buffer : memref<16xi32>) {
%idx_one = arith.constant 1 : index
%idx_two = arith.constant 2 : index
%idx_three = arith.constant 3 : index
%idx_four = arith.constant 4 : index
%idx_five = arith.constant 5 : index
%one = arith.constant 1 : i32
%six = arith.constant 6 : i32
%val = arith.addi %one, %six : i32
%step1 = arith.constant 1 : index
%step2 = arith.constant 2 : index
%lb1 = arith.constant 0 : index
%lb2 = arith.constant 1 : index
%ub1 = arith.constant 5 : index
%ub2 = arith.constant 4 : index
%mem1 = memref.alloc() : memref<25xi32>
scf.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%step1, %step2) {
%mul1 = arith.muli %iv1, %idx_five : index
%load_idx1 = arith.addi %mul1, %idx_four : index
%load_val1 = memref.load %mem1[%load_idx1] : memref<25xi32>
%mul2 = arith.muli %iv1, %idx_two : index
%load_idx2 = arith.addi %mul2, %idx_three : index
%load_val2 = memref.load %mem1[%load_idx2] : memref<25xi32>
%sum1 = arith.addi %load_val1, %load_val2 : i32
%mul3 = arith.muli %iv2, %idx_one : index
%load_idx3 = arith.addi %mul3, %idx_one : index
%load_val3 = memref.load %mem1[%load_idx3] : memref<25xi32>
%mul4 = arith.muli %iv1, %idx_four : index
%load_idx4 = arith.addi %mul4, %idx_two : index
%load_val4 = memref.load %buffer[%load_idx4] : memref<16xi32>
%sum2 = arith.addi %load_val3, %load_val4 : i32
%store_idx = arith.muli %iv2, %idx_three : index
%store_val = arith.addi %sum1, %sum2 : i32
memref.store %store_val, %buffer[%store_idx] : memref<16xi32>
}
return
}
}
And I have generated the memory access patterns:
| Memory access | (iv1=0, iv2=1) |
(0, 3) | (1, 1) | (1, 3) | (2, 1) | (2, 3) | (3, 1) | (3, 3) | (4, 1) | (4, 3) |
|---|---|---|---|---|---|---|---|---|---|---|
| load mem1[5iv1+4] | 4 | 4 | 9 | 9 | 14 | 14 | 19 | 19 | 24 | 24 |
| load mem1[2iv1+3] | 3 | 3 | 5 | 5 | 7 | 7 | 9 | 9 | 11 | 11 |
| load mem1[3iv2+1] | 4 | 10 | 4 | 10 | 4 | 10 | 4 | 10 | 4 | 10 |
| load buffer[4iv1+2] | 2 | 2 | 6 | 6 | 10 | 10 | 14 | 14 | 18 | 18 |
| store buffer[3iv2] | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 |
And I realize, for this example, and according to the documentation:
Semantically we require that the iteration space can be iterated in any order, and the loop body can be executed in parallel. If there are data races, the behavior is undefined.
Since (iv1=0, iv2=1) and (iv1=0, iv2=3) all access mem1's 4th, 3rd element, there will be potential data race and we cannot do any parallelism here. First, this example is completely artificial and may I'm over-complicating it. But if it were to happen, should we reject this kind of access patterns in the first place?
And if we take a step back to consider a somewhat simplified access pattern with only one induction variable:
| Memory access | iv1=0 |
1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| load mem1[5iv1+4] | 4 | 9 | 14 | 19 | 24 |
| load mem1[2iv1+3] | 3 | 5 | 7 | 9 | 11 |
What is the algorithm for determining the banking factor?
If we were to use some brute force algorithm, we search from bankingFactor = the size of access patterns (5 in the example above) to 1.
Let's arbitrarily take `bankingFactor = 4. we find out that,
- Group 1:
| Memory access | iv1=0 |
|---|---|
| load mem1[5iv1+4] | 4 |
| load mem1[2iv1+3] | 3 |
"consume"s two banks: 1. the first bank (4 % 4 = 0); 2. the fourth bank (3 % 4 = 3)
- Group 2:
| Memory access | iv1=1 |
|---|---|
| load mem1[5iv1+4] | 9 |
| load mem1[2iv1+3] | 5 |
"consume"s 1 bank, i.e., the second bank (9 % 4 = 1, 5 % 4 = 1); And Group 1 & 2 can be executed in parallel, since they are not sharing any bank:
par {
component1 {
iv1 = 0;
load mem1[5iv1+4];
load mem1[2iv1+3];
}
component2 {
iv1 = 1;
load mem1[5iv1+4];
load mem1[2iv1+3];
}
}
- Group 3:
| Memory access | iv1= 2 |
|---|---|
| load mem1[5iv1+4] | 14 |
| load mem1[2iv1+3] | 7 |
"consume"s two banks: 1. third (14 % 4 = 2); 2. fourth (7 % 4 = 3). So Group 3 cannot be executed in parallel with Group 1; but can be executed with Group 2.
- Group 4:
| Memory access | iv1=3 |
|---|---|
| load mem1[5iv1+4] | 19 |
| load mem1[2iv1+3] | 9 |
"consume"s two banks: 1. fourth (19 % 4 = 3); 2. second (9 % 4 = 1) So Group 4 conflicts with Group 1 (they all use the fourth bank) & Group 2 (they all use the second bank) & Group 3 (all use the fourth bank).
- Group 4:
| Memory access | iv1=4 |
|---|---|
| load mem1[5iv1+4] | 24 |
| load mem1[2iv1+3] | 1 |
"consume"s two banks: 1. first (24 % 4 = 0); 2. second (1 % 4 = 1). So Group 5 conflicts with Group 4, resulting Group 4 forced to be executed in sequence isolated from other groups; more analysis.
To put different groups together and determine which ones can be executed in parallel, this will almost sound like graph coloring problem. Is there any general way to do it?
- In addition, we might move on to other
bankingFactorand check if we can achieve a higher parallelism. - Worst case, if the work described above is too tedious; or if none is infeasible, we fall back the banking factor to 1, meaning there will be no parallelism at all: there will only be one
componentin thepargroup; and everything insidescf.parallelwill be unrolled and executed in sequence inside thecomponent.
Any suggestion to tackle this?
You can look at the Tracebank paper for details on how to automatically determine good banking factors. I think optimality exists but for large programs, you might have to use heuristics.
I second TraceBank.
Whoo, finally finished implementing everything! It's a lot trickier than I thought.
Please let me know what you think!
Hey @jiahanxie353, I'm going to focus on the implementation first. There is >1,300 new lines of code here; is it possible to break this up into multiple PRs? Second, can you give some insight behind the approach you've taken?
Hey @jiahanxie353, I'm going to focus on the implementation first. There is >1,300 new lines of code here; is it possible to break this up into multiple PRs? Second, can you give some insight behind the approach you've taken?
Thanks, @cgyurgyik !
is it possible to break this up into multiple PRs
Sure, I'll try!
can you give some insight behind the approach you've taken
Sure! My strategy is to
- Run partial evaluation on each
scf.parallelOpby
- manually compute the static value of each induction variable (needed for running the trace bank algorithm since it needs the memory trace value, which has to be known at compile time) and
- copy the body block and insert those new blocks into
scf.parallelOp's body region, wherein each block corresponds to each induction variable combination. - For example, if we have:
module {
func.func @main() {
%c2 = arith.constant 2 : index
%c1 = arith.constant 1 : index
%c3 = arith.constant 3 : index
%c0 = arith.constant 0 : index
%alloc = memref.alloc() : memref<6xi32>
%alloc_1 = memref.alloc() : memref<6xi32>
scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
%4 = arith.shli %arg3, %c2 : index
%5 = arith.addi %4, %arg2 : index
%6 = memref.load %alloc_1[%5] : memref<6xi32>
%7 = arith.shli %arg2, %c1 : index
%8 = arith.addi %7, %arg3 : index
memref.store %6, %alloc[%8] : memref<6xi32>
scf.reduce
}
return
}
}
it will have 4 blocks in total, corresponding to induction variables (0, 0), (0, 1), (2, 0), (2, 1), respectively. Then, I will copy the block and turn scf.parallel into:
scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
^bb0:
%induction_var1, %induction_var2 = 0, 0
%4 = arith.shli %induction_var2, %c2 : index
%5 = arith.addi %4, %induction_var1 : index
%6 = memref.load %alloc_1[%5] : memref<6xi32>
%7 = arith.shli %induction_var1, %c1 : index
%8 = arith.addi %7, %induction_var2 : index
memref.store %6, %alloc[%8] : memref<6xi32>
scf.reduce
^bb1:
%induction_var1, %induction_var2 = 0, 1
%4 = arith.shli %induction_var2, %c2 : index
%5 = arith.addi %4, %induction_var1 : index
%6 = memref.load %alloc_1[%5] : memref<6xi32>
%7 = arith.shli %induction_var1, %c1 : index
%8 = arith.addi %7, %induction_var2 : index
memref.store %6, %alloc[%8] : memref<6xi32>
^bb2:
similar...
^bb3:
similar...
}
- Verify that each memory read/write can indeed run in parallel;
- The algorithm I used is the tracebank algorithm. Basically, we group all addresses that need be accessed in parallel in to different "steps", and for each "step", we need to make sure that there's no memory access conflicts.
- For example, if we have:
scf.parallel (%arg2) = (%c0) to (%c2) step (%c1) {
^bb0:
%induction_var = 0
memref.store %1, %alloc[%induction_var] : memref<2xi32>
scf.reduce
^bb1:
%induction_var = 1
memref.store %1, %alloc[%induction_var] : memref<2xi32>
it will raise an error because we can't access %alloc in parallel even though we are indexing using different addresses (0 in bb0 and 1 in bb1);
- To this end, we need to separate the memories into banks, and we give users full control of how many banks are there for each memory by passing in a
jsonfile. If the number of available banks is sufficient, we separate the original memory into corresponding banks in a round-robin way; if no, we will raise an error saying the banks are not enough for parallel access. - The details of the algorithm can be found here. To summarize the algorithm, there are two main steps: (1)
findMasksBits, which finds all masks bits (mask bits "mask"s the memory trace/memory access values) that result in zero address-conflicts; (2)mapMaskIDsToBanksmaps memory trace value (the name is calledmapMaskIDsTobecause we use mask bits produced in the previous step to mask the memory trace values, resulting in more condensed, shortened values) to their corresponding banks. And the main data structure I used isConflictGraphandCliqueGraph.ConflictGraphcalculates potential memory access conflicts by graph coloring (using maximal clique for color's lower bound);CliqueGraphfinds maximal cliques in a graph. It's used in two places:- During the testing of graph colorability, as mentioned above;
- And second is a very interesting case, where we need to find maximal independent sets to identify the "steps" in the tracebank algorithm:
- By definition of a "step", it's a list of addresses that needs to be accessed in parallel. It's intuitive to map the concept of a "step" to a block inside
scf.parallel's region, since each block runs in parallel. Unfortunately, this didn't work out for me. - The caveat is when we have nested
scf.parallels, such as @rachitnigam 's Dahlia matmul example. And my criteria for parallelizable blocks is they are not in the same child block of a common ancestorscf.parallel, code here. - For example, if we have:
- By definition of a "step", it's a list of addresses that needs to be accessed in parallel. It's intuitive to map the concept of a "step" to a block inside
scf.parallel0{
^bb0_0:
scf.parallel0_0 {
^bb0_0_0:
op1;
op2;
^bb0_0_1:
op3;
op4;
}
^bb0_1:
scf.parallel0_1 {
^bb0_1_0:
op5;
op6;
^bb0_1_1:
op7;
op8;
}
op9
}
op1 cannot run in parallel with op2 because they share the closest common ancestor scf.parallel0_0, which has two children blocks: bb0_0_0 and bb0_0_1, and op1, op2 happen to be in the same child block. But op1 can be accessed in parallel with op3 or op4. And op1 can be accessed in parallel with op5 as well since they share the closest common ancestor scf.parallel0, and op1 is in its child block bb0_0 but op5 is in bb0_1(and it makes sense if we look at the semantics). Similarly, op1 can be run in parallel with op9; but op5 cannot. Thus, there are certain blocks that can be executed in parallel while some blocks cannot. And I tried to find all possible blocks that can be run in parallel, which is done through modeling every block as a node in a graph, and connect non-parallelizable blocks with edges, and then the problem is reduced to finding the maximal independent sets. Interestingly, finding the maximal independent sets is a dual of finding the maximal clique, as long as we change the adjacency list/matrix into its complement form.
- After identifying the parallelizable
blocks, I picked operations out of those blocks to generate "step"s, code here. For example,{op1, op3, op5, op8}is a valid step;{op1, op4, op9}is also a valid step; etc. - Then after verification, it's just a bunch of transformation to divide the original memories based on the mapped mask IDs to banks, code starts here
- Finally, we build the control for
scf.parallel: eachscf.parallelcorresponds to acalyx.par; and each block insidescf.parallelwill be acalyx.seq, which will be nested insidecalyx.par.
Please let me know if you have any question and/or suggestion, thanks!
Very cool! Thanks for the detailed explanation, it definitely clears things up. It would be great if you could split this up in several PRs, e.g., leave the nested par for a future PR. In general, I think the approach looks good, but am concerned about introduce an exponential algorithm (bron-kerbosch maximal clique) into a compiler setting. At the very least, this should probably not be turned on by default.
Breaks down to https://github.com/llvm/circt/pull/7804 and https://github.com/llvm/circt/pull/7830