finn
finn copied to clipboard
RTL ConvolutionInputGenerator
Overview
Adds an RTL Convolution Input Generator / Sliding Window Generator (SWG) that aims to replace all 16 "ConvolutionInputGenerator_*" function variants in finn-hlslib, extend functionality, and improve resource & latency efficiency.
Depending on the folding configuration, HDL code for one of two implementation styles is generated via (System-)Verilog templates.
- Out width <= in width: Use an addressable cyclic buffer ("default").
- Out width > in width: Use a combination of shift registers and cyclic line buffers ("parallel").
Parallelism is controlled via three attributes: SIMD
, parallel_window
, and M
. This results in the following operating modes:
SIMD | parallel_window | M | MMVin | MMVout | Impl. Style | Notes |
---|---|---|---|---|---|---|
C | True | M | M | M*K | parallel | Future PR alongside MVAU/VVAU extension |
C | True | 1 | 1 | K | parallel | |
C | False | 1 | 1 | 1 | default | |
<C | False | 1 | 1 | 1 | default |
(where C = #channels, K = k_h*k_w = #kernel elements, and MMV = #samples in parallel)
Despite the RTL implementation, the layer is provided as an HLSCustomOp (!). Use the InferConvInpGen(use_rtl_variant=True)
transformation to try it (defaults to False
for now).
Tests
New:
-
test_fpgadataflow_slidingwindow_rtl
(unit test)
Integrated into existing (rtlsim-based) tests:
-
test_convert_to_hls_conv_layer
-
test_convert_to_hls_1d_conv_layer
External/manual tests:
- End-to-end flow and HW execution for "bnn_pynq" (cnv), MobileNetV1 and VGG10
- Mass rtlsim & synthesis for testing resource/cycle estimates, utilization, latency, and timing (collected here: insert link)
Benchmark
Exemplary SWG resource consumption (combined) for MobileNetV1 benchmark (finn-examples):
Memory mode | LUT (HLS) | BRAM36 (HLS) | LUT (RTL) | BRAM36 (RTL) |
---|---|---|---|---|
Distributed | 59639 | 0 | 36687 | 0 |
Block | 4507 | 70.5 | 3100 | 49.5 |
The same for a fully-unrolled VGG10 (1D):
Memory mode | LUT (HLS) | FF (HLS) | LUT (RTL) | FF (RTL) |
---|---|---|---|---|
Distributed | 8600 | 11514 | 448 | 3612 |
Depends on
- [x] https://github.com/Xilinx/finn/pull/635
This is still a work-in-progress, required TODOs before merge:
- [x] Additional testing
- [ ] Update cycle & resource estimation
- [ ] Ensure proper auto folding
- [ ] Cleanup
- [ ] Documentation
Hi @fpjentzsch, could you change the target branch for the merge to target dev
, please?
Hi @preusser, @maltanar, @auphelia, as discussed, I moved the development of the dynamic FM sizing and parallel implementation mode to different branches for now so we can merge this part first.
Hi @preusser, @maltanar, thanks for your comments and tips! I incorporated them in above commit and marked those that should not require further discussion as resolved.
This looks nice! Do you have numbers comparing resource utilization/latency between the C++ and RTL implementations?
Hi @mgehre-amd, thanks, I plan to (re-)generate some experiments for a comprehensive HLS/RTL comparison once I find some time. I'll get back to you and link the results here.