calyx
calyx copied to clipboard
Toplevel `ref` cells interfere with static promotion
Branching off of #1603 - one blocker for deprecating @external
and adopting ref
's synthesis mode everywhere is that currently the version of Verilog generated by synthesis mode (externalizing all memories) incurs a performance cost.
Below is a comparison between (1) Calyx's current simulation mode (emitting readmemh
/writememh
- Calyx) and (2) Using ref
to externalize memories (CalyxRef). Time [ms]
is the average Verilator simulation time in milliseconds, and Cycles
is the number of cycles taken by the simulation. The cycle counts for CalyxRef are about 1.5-2x more than those for Calyx, and that affects the simulation time as well.
This performance drop is most likely because of additional FSMs generated by the Calyx compiler to keep track of memory updates.
"Reproduction"/Trying out CalyxRef
You can run Calyx programs in CalyxRef by using fud2
from the fud2-refmem
branch:
cd <CALYX_REPO>
git pull
git checkout fud2-refmem
cargo build
( cd fud2 ; cargo build)
Once you have fud2 set up, you can run CalyxRef by adding --through calyx-to-verilog-refmem
to your fud2 simulation command:
ex)
fud2 examples/tutorial/language-tutorial-iterate.futil --to dat --through calyx-to-verilog-refmem -s sim.data=`pwd`/examples/tutorial/data.json
should give you
{
"cycles": 76,
"memories": {
"mem": [
42
]
}
}
whereas running Calyx without externalized memories (Calyx):
fud2 examples/tutorial/language-tutorial-iterate.futil --to dat --through calyx-to-verilog -s sim.data=`pwd`/examples/tutorial/data.json
gives you
{
"cycles": 37,
"memories": {
"mem": [
42
]
}
}
.
Thanks for the explanation! I'm not sure if there is anything to do about this. The readmemh
and writememh
functions hide the latency to read and write IO memories and ref
is just making it explicit. If you want to report only the compute part, you can subtract out the amount of time to read and write things.
Anyways, this goes to show that real accelerators need to optimize more than just compute; memory movement often dominates!
Indeed; thanks, @ayakayorihiro, for the detailed reproduction instructions! I could even reproduce the effect for the simpler language-tutorial-compute.futil
: 3 cycles vs. 5 cycles. I admit I got a little nerdsniped by this. I don't think it should be slower, and I wanted to know why it was. Relevantly, re. @rachitnigam's comment:
The
readmemh
andwritememh
functions hide the latency to read and write IO memories andref
is just making it explicit.
I don't think this is the case here. readmemh
and writememh
exist in both treatments. And the ref
shouldn't add any cycles, I think: we're either accessing a comb_mem_d1
locally or by reference, in the testbench. This makes me think something is wrong with the way we're compiling the ref
in general, and it might not have to do with the testbench approach.
Here are two VCDs I dumped from Icarus simulation of the two under the two treatments normal.vcd.txt refmem.vcd.txt
The thing to notice in the waveforms is that the ref
version has two FSM registers, fsm
and fsm0
, while the "normal" one just gets 1:
I dug into the way these two are compiled (going pass-by-pass through the compilation pipeline). Critically, it looks like static promotion failed for the ref
version. Its control after promotion is:
seq {
static<2> seq {
read0;
upd0;
}
write;
}
Whereas the "normal" version yields:
static<3> seq {
read0;
upd0;
write0;
}
(To reproduce this, try:
cargo run -- -p infer-data-path -p compile-invoke -p static-inference -p static-promotion examples/tutorial/language-tutorial-compute.futil
and add -p external-to-ref
at the beginning to see the difference.)
In the end, the problem is that the program's write
group is not inferred static when the underlying memory is a ref
. That group looks like this, FWIW:
group write {
mem.addr0 = 1'b0;
mem.write_en = 1'b1;
mem.write_data = val.out;
write[done] = mem.done;
}
(The read
group is always inferred static, probably because it's reading combinationally from a comb_mem_d1
.)
Anyway, it seems pretty clear what's going on: top-level ref
memories, unlike top-level @external
memories, are interfering with static inference. That seems like the bug to fix here.
Wow, thanks a lot for looking into this @sampsyo ! It's good to know that the problem was from a bug rather than something fundamental about externalizing memories. If we can get this fixed then there seems to be no big obstacles around always externalizing memories via ref
?
Indeed, that would be awesome! Then we would have only one testbench where the readmemh
/writememh
s live, and we could remove a lot of intelligence from the compiler itself that currently deals with emitting this…