finn icon indicating copy to clipboard operation
finn copied to clipboard

RTL ConvolutionInputGenerator

Open fpjentzsch opened this issue 2 years ago • 1 comments

Overview

Adds an RTL Convolution Input Generator / Sliding Window Generator (SWG) that aims to replace all 16 "ConvolutionInputGenerator_*" function variants in finn-hlslib, extend functionality, and improve resource & latency efficiency.

Depending on the folding configuration, HDL code for one of two implementation styles is generated via (System-)Verilog templates.

  • Out width <= in width: Use an addressable cyclic buffer ("default").
  • Out width > in width: Use a combination of shift registers and cyclic line buffers ("parallel").

Parallelism is controlled via three attributes: SIMD, parallel_window, and M. This results in the following operating modes:

SIMD parallel_window M MMVin MMVout Impl. Style Notes
C True M M M*K parallel Future PR alongside MVAU/VVAU extension
C True 1 1 K parallel
C False 1 1 1 default
<C False 1 1 1 default

(where C = #channels, K = k_h*k_w = #kernel elements, and MMV = #samples in parallel)

Despite the RTL implementation, the layer is provided as an HLSCustomOp (!). Use the InferConvInpGen(use_rtl_variant=True) transformation to try it (defaults to False for now).

Tests

New:

  • test_fpgadataflow_slidingwindow_rtl (unit test)

Integrated into existing (rtlsim-based) tests:

  • test_convert_to_hls_conv_layer
  • test_convert_to_hls_1d_conv_layer

External/manual tests:

  • End-to-end flow and HW execution for "bnn_pynq" (cnv), MobileNetV1 and VGG10
  • Mass rtlsim & synthesis for testing resource/cycle estimates, utilization, latency, and timing (collected here: insert link)

Benchmark

Exemplary SWG resource consumption (combined) for MobileNetV1 benchmark (finn-examples):

Memory mode LUT (HLS) BRAM36 (HLS) LUT (RTL) BRAM36 (RTL)
Distributed 59639 0 36687 0
Block 4507 70.5 3100 49.5

The same for a fully-unrolled VGG10 (1D):

Memory mode LUT (HLS) FF (HLS) LUT (RTL) FF (RTL)
Distributed 8600 11514 448 3612

Depends on

  • [x] https://github.com/Xilinx/finn/pull/635

This is still a work-in-progress, required TODOs before merge:

  • [x] Additional testing
  • [ ] Update cycle & resource estimation
  • [ ] Ensure proper auto folding
  • [ ] Cleanup
  • [ ] Documentation

fpjentzsch avatar Jun 13 '22 20:06 fpjentzsch

Hi @fpjentzsch, could you change the target branch for the merge to target dev, please?

auphelia avatar Jul 26 '22 12:07 auphelia

Hi @preusser, @maltanar, @auphelia, as discussed, I moved the development of the dynamic FM sizing and parallel implementation mode to different branches for now so we can merge this part first.

fpjentzsch avatar Sep 09 '22 19:09 fpjentzsch

Hi @preusser, @maltanar, thanks for your comments and tips! I incorporated them in above commit and marked those that should not require further discussion as resolved.

fpjentzsch avatar Sep 16 '22 13:09 fpjentzsch

This looks nice! Do you have numbers comparing resource utilization/latency between the C++ and RTL implementations?

mgehre-amd avatar Sep 19 '22 07:09 mgehre-amd

Hi @mgehre-amd, thanks, I plan to (re-)generate some experiments for a comprehensive HLS/RTL comparison once I find some time. I'll get back to you and link the results here.

fpjentzsch avatar Sep 20 '22 12:09 fpjentzsch