calyx icon indicating copy to clipboard operation
calyx copied to clipboard

Large, multi-component calyx programs crash during wire-inlining pass

Open susan-garry opened this issue 2 years ago • 3 comments

As in the title. It appears to be a stack overflow issue.

The files I'm using: DRB1.futil.txt DRB1.data.txt

The command I'm using: fud e --to dat --through verilog -s verilog.data DRB1.data DRB1.futil

The debugger's error message (excluding verilog code):

[fud] DEBUG: futil.run_futil(<_io.BufferedReader name='DRB1.futil'>)
[fud] DEBUG: /scratch/susan/calyx/target/debug/futil -l /scratch/susan/calyx -b\
 verilog
[fud] DEBUG: Stdin is `<class '_io.BufferedReader'>`
[WARN  calyx::pass_manager] cell-share: 431198ms
[WARN  calyx::pass_manager] static-par-conv: 18496ms
[WARN  calyx::pass_manager] tdcc: 11343ms
[WARN  calyx::pass_manager] comb-prop: 16894ms
[WARN  calyx::pass_manager] dead-cell-removal: 32220ms
[WARN  calyx::pass_manager] wire-inliner: 13918ms

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

The release code's error message (excluding verilog code):

[fud] ERROR: `/opt/verilator-v4.220/bin/verilator -cc --trace /tmp/tmpdlplnbix \
--exe /scratch/susan/calyx/fud/sim/testbench.cpp --build --top-module main --Md\
ir /tmp/tmphaaq4qxa>&2' failed:
=====STDERR=====
%Error: /tmp/tmpdlplnbix:1130400:107771: Too many preprocessor tokens on a line\
 (>40000); perhaps recursive `define
%Error: /tmp/tmpdlplnbix:885322:28: memory exhausted
885322 | msp2495_go_out ? 12'd2496 :
    |              ^
%Error: Cannot continue

At first I thought this could be an issue with trying to have too many commands run in parallel, but I get the exact same error message when I restrict the design to just 10 threads (which works successfully on smaller files). Here is one such program: DRB1-10pes.futil.txt

Any idea what might be causing this? Is it something I can fix by editing the .futil files?

susan-garry avatar Feb 13 '23 20:02 susan-garry

Thanks for the detailed repro instructions! We should totally get to the bottom of this. Here are some notes on attempting to understand the problem.

To summarize what's going on, the underlying problem is (vaguely) that the generated code is too big. Concretely, the symptom manifests in two ways:

  • When Calyx is built in debug mode, it crashes with a stack overflow (probably in Verilog emission, but it's not clear).
  • When Calyx is built in release mode, the Calyx compiler itself finishes just fine, but it emits Verilog code that Verilator cannot compile. That is, the Calyx compiler successfully generates Verilog; then, Verilator opens up that Verilog program and says "no way; the lines here are too big."

Proper solutions here include stuff like #1364, which may help the Calyx compiler generate Verilog that Verilator can actually compile. I think there may be other, deeper scalability issues we should think carefully about, so I want to emphasize that #1364 may not be the end of the story here.

Commands to Reproduce

This command isolates the Calyx compiler invocation:

cargo run --release -- -b verilog DRB1.futil > DRB1.sv

That's for running the release version. On my laptop, this takes 2:08 to produce the Verilog file. (Debug mode takes much longer in order to crash.) The only long pass that the compiler complains about is cell-share:

[WARN  calyx::pass_manager] cell-share: 70887ms

…that is, a bit more than half the compilation time is in that pass.

Here's a Verilator command that reproduces its crash:

verilator --trace DRB1.sv --cc -fno-inline

Again on my laptop, this takes 2:21 to eventually crash.

Tracking Down the Root Cause

While we also want Calyx to be able to handle huge programs like this, at the same time we would like to know if we can change the original code to avoid the problem even now. Understanding this root cause would both help us work around the issue right now and help identify the problem in Calyx itself. So let's start with the Verilog and trace things back to the relevant code.

On my most recent run, here's how Verilator indicated the code it doesn't like:

%Error: DRB1.sv:1542396:107771: Too many preprocessor tokens on a line (>40000); perhaps recursive `define
%Error: DRB1.sv:1212899:28: memory exhausted
1212899 |  msp2495_go_out ? 13'd2496 :
        |                            ^
%Error: Cannot continue

As a side note, the Verilog file is 1.5 million lines long, and this message indicates one that's 1.2 million lines into the file. Anyway, looking at that line, it's part of a very long ternary-expression cascade we're emitting:

...
 msp2493_go_out ? 13'd2494 :
 msp2494_go_out ? 13'd2495 :
 msp2495_go_out ? 13'd2496 :
 msp2496_go_out ? 13'd2497 :
 msp2497_go_out ? 13'd2498 :
...

It's surprisingly hard to navigate this file to find out where the big expression begins, but vim's :LogiPat command helped. The top and bottom of the expression look like:

assign depth_output_addr0 =
 msp_go_out ? 13'd0 :
 msp0_go_out ? 13'd1 :
 msp1_go_out ? 13'd2 :
 msp2_go_out ? 13'd3 :
...
 msp4950_go_out ? 13'd4951 :
 msp4951_go_out ? 13'd4952 :
 msp4952_go_out ? 13'd4953 :
 msp4953_go_out ? 13'd4954 : 13'd0;

In purely hardware terms, this expression is acting as a one-hot-to-binary encoder. It takes 4955 one-bit signals, expecting only one of them to be high, and produces a 13-bit binary number indicating which signal was high.

Anyway, going back to the original Calyx code, this comes from a series of groups that store into the depth_output memory:

    group store_depth0 {
      depth_output.addr0 = 13'd0;
      depth_output.write_data = depth1.out;
      depth_output.write_en = 1'd1;
      store_depth0[done] = depth_output.done;
    }
    group store_depth1 {
      depth_output.addr0 = 13'd1;
      depth_output.write_data = depth2.out;
      depth_output.write_en = 1'd1;
      store_depth1[done] = depth_output.done;
    }
...
    group store_depth4954 {
      depth_output.addr0 = 13'd4954;
      depth_output.write_data = depth4955.out;
      depth_output.write_en = 1'd1;
      store_depth4954[done] = depth_output.done;
    }

Those groups are created here: https://github.com/cucapra/pollen/blob/655c80b1f04dc76ca20617e82f201e9d76fc339b/pollen/depth/calyx_depth.py#L182

Namely, Pollen creates max_nodes of these groups: https://github.com/cucapra/pollen/blob/655c80b1f04dc76ca20617e82f201e9d76fc339b/pollen/depth/calyx_depth.py#L117

Conclusions for Pollen

So, here's one perhaps-obvious takeaway: Reducing max_nodes is one way to avoid the problem for now. That's not all that satisfying, but it does identify the knob that makes the difference between runnable and crash-inducing Pollen programs.

A longer-term recommendation is that time-multiplexing the hardware will help. By that, I mean that if we can make the amount of hardware (the number of groups) not proportional to max_nodes, then we won't need 4955 different groups to write into the hardware: we can have a fixed number $N$ that get each get reused $4955/N$ times.

Conclusions for Calyx

Unfortunately, the problem is in the Verilog backend, so splitting conditions like in #1364 will probably not help, and nor will disabling merge-assigns. Namely, the Verilog backend is what is currently responsible for coalescing many assignments into Verilog ternary expressions. The comment in the code is extremely clear: https://github.com/cucapra/calyx/blob/16be3a7d3c0afdc5b09c3fb7ad1f567012fc2f03/src/backend/verilog.rs#L475-L488

So the problem is not big assignment conditions in Calyx code; it's that the Verilog backend is taking many small assignments and turning them into a single huge line of Verilog. So instead, the correct fix here is changing the Verilog backend to break assignment groups into multiple lines. Perhaps by using if cascades instead of ternary operators—I'm not sure what the best alternative is.

HTH!

sampsyo avatar Feb 21 '23 19:02 sampsyo

Just wanted to also add a reverse pointer to #1370, which would probably be a very easy change that would make the Calyx side of things go much faster (and possibly prevent the crash in debug mode).

sampsyo avatar Feb 22 '23 15:02 sampsyo

Discussed the possibility of the following rewrites with @priyasrikumar:

x = a ? 1 :
    b ? 2 :
    c ? 3 :
    d ? 4 : 0;

// =>

p = a ? 1 :
    b ? 2 : 0;
q = c ? 3 :
    d ? 4 : 0;
x = a | b ? p :
    c | d ? q : 0;

// =>

p = ...;
q = ...;
c1 = a | b;
c2 = c | d;
x = c1 ? p :
    c2 ? q : 0;

The transformations trade-off the size of the ternary expression for bigger guards which can themselves be reduced by #1364.

rachitnigam avatar Feb 24 '23 19:02 rachitnigam