vtr-verilog-to-routing icon indicating copy to clipboard operation
vtr-verilog-to-routing copied to clipboard

In some benchmarks, resource usage differs significantly when using Yosys+ODIN flow, compared to ODIN only flow

Open aman26kbm opened this issue 1 year ago • 9 comments

Expected Behaviour

The resource usage between the two flows should be similar.

Current Behaviour

For the CLSTM benchmark, we are seeing the resource usage differ greatly when we run with Yosys+ODIN compared to when we run ODIN only.

clstm

The benchmark is: vtr-verilog-to-routing/clstm_like.small.v at master · aman26kbm/vtr-verilog-to-routing (github.com)

The arch file is: vtr-verilog-to-routing/k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml at master · aman26kbm/vtr-verilog-to-routing (github.com)

We are running run_vtr_task with a fixed channel width of 300.

Update from Seyed on a thread: I somehow found the problem related to the clstm benchmark; it is actually coming from the Yosys coarse-grained synthesis commands. We lost some dffs in submodules since they were optimized out as it was requested by the script. Will provide comprehensive information on its related thread; however, we require to re-run all benchmarks with the new script.

Possible Solution

Steps to Reproduce

Use the arch file and design file linked above

Context

Your Environment

  • VTR revision used:
  • Operating System and version:
  • Compiler version:

aman26kbm avatar Jul 24 '22 18:07 aman26kbm

@aman26kbm - thanks for raising this issue.

As I mentioned earlier, the Yosys-generated coarse-grained BLIF files miss a few internal connections in deep submodules (compared to the top module). I figured out that this issue can be resolved by adding expose -evert-dff command before flattening the netlist.

Based on the Yosys manual(page 118), the expose -evert-dff command turns flip-flops to sets of inputs and outputs for a given module. In this way, although the number of memory blocks is equal to the Yosys standalone and Odin-II standalone synthesizers, the number of input and output pins will change considerably due to the effect of the above-mentioned command.

I examined performing the expose -evert-dff command for all submodules before flattening the design and then flattening everything into the top module; however, the result remained the same. Indeed, the above-mentioned command works if it's been called for all modules, including the top module. The outcome for having all dff exposed as sets of input/output pins is not desired in terms of the number of input/output pins.

@alirezazd is assigned to this issue last week. He's been working on the Yosys coarse-grained script to figure out how we can keep the missing dffs connected in the coarse-grained synthesis while the number of inputs and outputs remains unchanged. @alirezazd please provide updates on this thread regularly, as we see a tight deadline.

sdamghan avatar Jul 24 '22 19:07 sdamghan

Thanks for the update, Seyed. Appreciate it.

aman26kbm avatar Jul 24 '22 20:07 aman26kbm

Hi Seyed, I compared the resource usage between odin-only and odin+yosys flows for other benchmarks. And I see one case that has a significant difference. That is benchmark spmv (https://github.com/aman26kbm/vtr-verilog-to-routing/blob/master/vtr_flow/benchmarks/verilog/koios/spmv.v)

ODIN-only shows 32 multipliers, but Yosys+ODIN shows 5. But the difference is seen in vpr.out, not in odin.out. I am attaching the tar balls with the results from both cases. Please take a look.

yosodin_spmv.tar.gz odin_spmv.tar.gz

aman26kbm avatar Jul 27 '22 03:07 aman26kbm

Hey Seyed, I looked at the spmv design in some more detail. Here are some observations that may help debug this issue.

The number of adders, multipliers, rams in odin.out matches in both odin-only and odin+yosys flows. But in vpr.out, the number of adders, multipliers and rams in odin+yosys flow is lesser than the number in odin-only flow.

Looking at the warnings in vpr.out, there are many nets that are identified as constant zero generators. That turns out to be because indeed those nets are tied to 0 (.names with a 0 driver) in the pre-vpr.blif file. Tracing back, I see that this happens during abc stage.

The blif file after synthesis (either odin-only or odin+yosys flows) has those nets connected to something meaningful. For the nets I looked at, they were outputs of flip-flops. But in the blif file after abc, they are tied to 0.

I then tried to identify which part of the design is the problem coming from. I am able to recreate the problem when I use "bvb" module as the top. You can use this for debug.

Also, please use the latest version of the design (https://github.com/aman26kbm/vtr-verilog-to-routing/blob/master/vtr_flow/benchmarks/verilog/koios/spmv.v), because I removed some minor issues that were causing warnings and some other things that I thought could be causing this (like async reset).

aman26kbm avatar Aug 07 '22 19:08 aman26kbm

Great @aman26kbm - these are helpful findings; let us go through the detail. I will provide more updates once we find the issue root.

Update: @aman26kbm - I double-checked the design and tried to run both Yosys+Odin-II and Odin-II synthesizers with different top modules chosen from spmv modules. Based on my observation, the BVB and Big_Channel modules resulted in the same number of memory hard blocks using both synthesizers. However, the module fetcher does not, and indeed it ended with 12 memory blocks less using Yosys+Odin-II (mem-slices: 120) compared to Odin-II (mem-slices: 132).

The results that I achieved for all modules are as follow. It would be good if you could look at the fetcher module while I am investigating more about the DSP blocks. I saw many signals in fetcher were removed by Yosys in the Yosys+Odin-II elaboration phase. As we previously had some optimized memory blocks that were driving DSP blocks, I guess checking these removed signals is worth it. For reference, please look at elaoration.yosys.log in the Top_Fetcher_Yosys+Odin-II directory.

Synthesizer Top Module VPR Snippet
Odin-II BVB Top_BVB_Odin-II.tar.gz
Yosys+Odin-II BVB Top_BVB_Yosys+Odin-II.tar.gz
Odin-II Fetcher Top_Fetcher_Odin-II.tar.gz
Yosys+Odin-II Fetcher Top_Fetcher_Yosys+Odin-II.tar.gz
Odin-II Big_Channel Top_BigChannel_Odin-II.tar.gz
Yosys+Odin-II Big_Channel Top_BigChannel_Yosys+Odin-II.tar.gz
Odin-II SPMV Top_SPMV_Odin-II.tar.gz
Yosys+Odin-II SPMV Top_SPMV_Yosys+Odin-II.tar.gz

sdamghan avatar Aug 08 '22 11:08 sdamghan

Somehow I didn't notification for the update.

Anyway I looked at the attached files for the case where Fetcher is defined as top. The difference in memory usage seems to be only because of different packing. The number of slices is the same. Pasting some contents below:

ODIN only: In odin.out:

189 Total Logical Memory Blocks = 99
190 Total Logical Memory bits = 399360
191 Max Memory Width = 8
192 Max Memory Depth = 16384

6378 Number of <MEMORY> node:                  960

In vpr.out:

3874   memory               : 132
3875    mem_2048x10_sp      : 36
3876     memory_slice       : 192
3877    mem_1024x20_dp      : 96
3878     memory_slice       : 768

Yosys+ODIN: In odin.out:

 166 Total Logical Memory Blocks = 99
 167 Total Logical Memory bits = 399360
 168 Max Memory Width = 8
 169 Max Memory Depth = 16384

257 Number of <MEMORY> node:                  960

In vpr.out:

1072   memory               : 120
1073    mem_2048x10_sp      : 24
1074     memory_slice       : 192
1075    mem_1024x20_dp      : 96
1076     memory_slice       : 768

aman26kbm avatar Aug 10 '22 05:08 aman26kbm

A few other things to mention that could be relevant:

  1. There are some ROMs in the design. They are RAMs with write_enable and data_in tied to 0. Do ODIN and ODIN+Yosys have different behavior for ROMs? In another design, Andrew saw that Quartus was optimizing out the block because it assumed everything was 0 because the input was 0.
  2. There is 1 multiplier in the module Channel, which then gets instantiated 32 times (module called Big_Channel) leading to an expected multiplier count of 32. When I run with the Big_Channel as top, then I see both case (odin and odin+yosys) have 32 multipliers.

aman26kbm avatar Aug 10 '22 05:08 aman26kbm

I converted the ROMs into RAMs (brought out addr, write_en and din to the top), but still see the same behavior. That, is with Yosys+ODIN, we see only 5 multipliers, but with ODIN, we see 32 multipliers. :(

aman26kbm avatar Aug 10 '22 06:08 aman26kbm

A few other things to mention that could be relevant:

  1. There are some ROMs in the design. They are RAMs with write_enable and data_in tied to 0. Do ODIN and ODIN+Yosys have different behavior for ROMs? In another design, Andrew saw that Quartus was optimizing out the block because it assumed everything was 0 because the input was 0.
  2. There is 1 multiplier in the module Channel, which then gets instantiated 32 times (module called Big_Channel) leading to an expected multiplier count of 32. When I run with the Big_Channel as top, then I see both case (odin and odin+yosys) have 32 multipliers.

There is no optimization or special behaviour for ROMs in Odin-II and Yosys+Odin-II. Both synthesizers treat the memory blocks as defined in the HDL file by the user. So for example, the following code using both Odin-II and Yosys+Odin-II results in a single port ram that has enable and data signals connected to gnd:

module spram_instance(
	address,
	value_in,
	we,
	clock,
	value_out
);

	parameter WIDTH = 16;	// Bit width
	parameter DEPTH = 8;	// Bit depth

        /*  Input Declaration */
	input	[DEPTH-1:0] 	        address;
	input 	[WIDTH-1:0]		value_in;
	input                           we;
	input                           clock;

	/*  Output Declaration */
	output	[WIDTH-1:0]             value_out;

	defparam inst1.ADDR_WIDTH = DEPTH; 
	defparam inst1.DATA_WIDTH = WIDTH;
	single_port_ram inst1 (
		.addr	( address ),
		.data	( 0 ),
		.we		( 0 ),
		.clk    ( clock ),
		.out	( value_out )
	);

endmodule

sdamghan avatar Aug 10 '22 14:08 sdamghan