finn RTL ConvolutionInputGenerator

Overview

Adds an RTL Convolution Input Generator / Sliding Window Generator (SWG) that aims to replace all 16 "ConvolutionInputGenerator_*" function variants in finn-hlslib, extend functionality, and improve resource & latency efficiency.

Depending on the folding configuration, HDL code for one of two implementation styles is generated via (System-)Verilog templates.

Out width <= in width: Use an addressable cyclic buffer ("default").
Out width > in width: Use a combination of shift registers and cyclic line buffers ("parallel").

Parallelism is controlled via three attributes: SIMD, parallel_window, and M. This results in the following operating modes:

SIMD	parallel_window	M	MMVin	MMVout	Impl. Style	Notes
C	True	M	M	M*K	parallel	Future PR alongside MVAU/VVAU extension
C	True	1	1	K	parallel
C	False	1	1	1	default
<C	False	1	1	1	default

(where C = #channels, K = k_h*k_w = #kernel elements, and MMV = #samples in parallel)

Despite the RTL implementation, the layer is provided as an HLSCustomOp (!). Use the InferConvInpGen(use_rtl_variant=True) transformation to try it (defaults to False for now).

Tests

New:

test_fpgadataflow_slidingwindow_rtl (unit test)

Integrated into existing (rtlsim-based) tests:

test_convert_to_hls_conv_layer
test_convert_to_hls_1d_conv_layer

External/manual tests:

End-to-end flow and HW execution for "bnn_pynq" (cnv), MobileNetV1 and VGG10
Mass rtlsim & synthesis for testing resource/cycle estimates, utilization, latency, and timing (collected here: insert link)

Benchmark

Exemplary SWG resource consumption (combined) for MobileNetV1 benchmark (finn-examples):

Memory mode	LUT (HLS)	BRAM36 (HLS)	LUT (RTL)	BRAM36 (RTL)
Distributed	59639	0	36687	0
Block	4507	70.5	3100	49.5

The same for a fully-unrolled VGG10 (1D):

Memory mode	LUT (HLS)	FF (HLS)	LUT (RTL)	FF (RTL)
Distributed	8600	11514	448	3612

Depends on

[x] https://github.com/Xilinx/finn/pull/635

This is still a work-in-progress, required TODOs before merge:

[x] Additional testing
[ ] Update cycle & resource estimation
[ ] Ensure proper auto folding
[ ] Cleanup
[ ] Documentation

Jun 13 '22 20:06 fpjentzsch

Hi @fpjentzsch, could you change the target branch for the merge to target dev, please?

Jul 26 '22 12:07 auphelia

Hi @preusser, @maltanar, @auphelia, as discussed, I moved the development of the dynamic FM sizing and parallel implementation mode to different branches for now so we can merge this part first.

Sep 09 '22 19:09 fpjentzsch

Hi @preusser, @maltanar, thanks for your comments and tips! I incorporated them in above commit and marked those that should not require further discussion as resolved.

Sep 16 '22 13:09 fpjentzsch

This looks nice! Do you have numbers comparing resource utilization/latency between the C++ and RTL implementations?

Sep 19 '22 07:09 mgehre-amd

Hi @mgehre-amd, thanks, I plan to (re-)generate some experiments for a comprehensive HLS/RTL comparison once I find some time. I'll get back to you and link the results here.

Sep 20 '22 12:09 fpjentzsch

finn finn copied to clipboard

RTL ConvolutionInputGenerator

Overview

Tests

Benchmark

Depends on

This is still a work-in-progress, required TODOs before merge:

finn
finn copied to clipboard