calyx `group2seq` can make programs slower

`group2seq` can make programs slower

Open ayakayorihiro opened this issue 5 months ago • 1 comments

trafficstars

I ran into an interesting phenomenon with the group2seq pass when running the profiler pass (validation --> compile-invoke --> profiler-instrumentation --> pre-opt ...) on some sample programs. It turns out that having compile-invoke run before group2seq can cause faster programs than in -p all by disabling group2seq in certain situations, where group2seq is ran before compile-invoke. This means that the optimization pass group2seq can actually pessimize programs instead of optimizing them.

The minimized example program contains a subcomponent (child_component) that contains a single group (reading in a value from a ref-ed memory):

     group read {
      arg_mem.addr0 = 1'b0;
      arg_mem.content_en = 1'd1;
      load_1_reg.in = arg_mem.read_data;
      load_1_reg.write_en = arg_mem.done;
      read[done] = load_1_reg.done;
    }

When compile-invoke is run before optimization passes, the ports to arg_mem get converted to wires (ex. arg_mem.addr0 --> arg_mem_addr0) which prohibits group2seq from splitting the group into two parts: (left is normal compilation, right is when compile-invoke is run before opt)

So in normal compilation we are now dealing with two groups under a seq, whereas in the compile-invoked version we have a singular group. The end_spl_read group on the left gets converted to a invoke on the register load_1_reg. Unfortunately the seq in the left/normal version of the code doesn't get converted to static during static-inference and static-promotion :disappointed: I think this is once again because the ref-ed memory ports are converted to wires, which static can't infer about, so the beg_spl_read is not promotable.

Because the control retains

seq { beg_spl_read; invoke0; }

by the time we get to TDCC, effectively we end up with a FSM with 3 states (beg_spl_read --> invoke0 --> done) which requires 2 extra cycles before we get to the end. When ran, this version of the program takes 4 cycles. But, the version with compile-invoke before optimization still retains all the logic in one group, so it doesn't need a FSM. When ran, this version of the program takes 2 cycles.

Here is the minimized program (m.txt can be renamed to m.futil and m.json can be renamed to m.data if desired) in case anyone wants to play around with it! m.txt m.json

Jun 04 '25 16:06 ayakayorihiro

Note: As @sampsyo suggested, the ideal solution is to make group2seq never pessimize programs by updating TDCC (or some other downstream pass) to react to situations like this, but we should wait until the TDCC FSM update is fully done!

Jun 04 '25 16:06 ayakayorihiro

calyx calyx copied to clipboard

`group2seq` can make programs slower

calyx
calyx copied to clipboard