finn New method for automatically setting folding factors

Background: https://github.com/Xilinx/finn/issues/297

This transformation sets the folding factors for a network and guarantees that constraints are satisfied - both on a node-level and between nodes. It has been tested against the CNV networks and MobileNet-v1 and the folding factors created look good. The algorithm maps out every single "legal" folding configuration for the nodes and iteratively updates the slowest nodes to increase overall network throughput - hence the name SetFoldingExhaustive

Caveats:

Included small dictionary with boards to briefly show how resource limits can be taken into account during optimization. Will most likely need something like https://github.com/Xilinx/finn-experimental/blob/main/src/finn/util/platforms.py in order to have a more rigid structure around the boards. If a board is not passed, the target cycle count will suffice as a target for the optimization.
Nodes that do not have lut_estimation() - for these nodes, I made the assumption that they have insignificant resource usage and they'd be updated during optimization if necessary. Possible to model their resource usage post-synthesis for different ranges of folding configs and then use that relationship to estimate resources, but maybe there's a reason for them not to have a lut_estimation() method?
As with AllocateResources, the algorithm makes use of a downscaling factor. Rather than 0.7, it is set to 0.85. Maybe need to set it lower if point above does not hold and those nodes have a significant resource usage.

For future work in this space, I think it would be interesting to start taking more resources into account (BRAM, DSPs, URAM) as these resources are not being set automatically in most cases. I also believe this setup would be useful here, since the data structure containing the possible folding factors for a node, also holds the estimated resource use for that folding configuration. Thus the algorithm could be extended to make use of these resource estimates as well.

The very high-level steps of the algorithm are shown here. Taken from our BSc thesis where the work only applied to CNV networks so it might differ slightly since support for MobileNet-v1 has been included in the code.

May 28 '21 07:05 neilkimn

Thanks Neil! I want to make sure we give this a thorough review as it's a critical part of FINN, so I've requested a few extra reviews on this from colleagues.

Jun 07 '21 09:06 maltanar

Is there a reason why a new platforms dictionary was added instead of using the one in finn-experimental? @maltanar Should we be proactive and plan to merge the finn-experimental platforms definition into FINN with this PR, instead of adding yet another platform definition dictionary? At minimum @neilkimn should consider reworking the PR to use the platform definitions in finn-experimental (since FINN docker now provides finn-experimental).

Finally, PR #346 is related to this one as it fixes resource estimates for external-weights FC layers and increments finn-experimental version. I think that PR should be merged first.

Jun 14 '21 07:06 quetric

Hi @neilkimn , thank you a lot for your pull-request. This feature seems to be something that indeed many people are interested in. As example, I also implemented a variation of this during my master thesis, but the code base is not as polished as yours. So far I only got to take a top-level look at your implementation, but a few questions and comments already came up:

It might make more sense to use target_fps instead of target_cycles_per_frame, since the clk_ns is also given. This would make it easier for users to understand what they are influencing and this is also what the DataflowBuildConfig uses.
During testing I noticed that the transformation will fail with a cryptic TypeError, if the nodes in the model graph don't have names. As such it would be necessary that the SetFoldingExhaustive implementation either executes GiveUniqueNodeNames before starting or confirms on it's own that all nodes have names.
In some cases finn-hls also has boundary conditions for the minimum number of the SIMD parameter. These stem from the fact that vivado_hls enforces that ARRAY_PARTITION COMPLETE variables are at most 1024 elements large, see this form post: https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/ARRAY-PARTITION-COMPLETE-has-exceeded-the-threshold-1024/td-p/862145 The only case that I found where the folding is impacted by this in practice is in the Matrix_Vector_Activate_Batch function, here: https://github.com/Xilinx/finn-hlslib/blob/b37337c571b98f40423020bc79f97e189f2661d5/mvau.hpp#L108 This means that the following relation should be obeyed for the StreamingFCLayer_Batch nodes: SIMD > mw / 1024. Otherwise the synthesis is bound to fail. This is in particular important when setting the folding factors from scratch.

It would be great if you could take a closer look at these comments.

Again, thank you a lot for your PR. It is highly appreciated!

Jun 14 '21 09:06 heborras

Hi all, Thanks for the thorough feedback and insightful comments. @quetric, I largely agree with your points on using the information supplied by the platforms in finn-experimental and have thus changed the instantiation of the class to not make use of any board definitions for now. Until more has been decided on incorporating what's available in finn-experimental, I belive this will suffice (it does however move the 'issue' of utilizing the board definitions further upstream and the user can, of course, go through the build steps without specifying a board.)

@HenniOVP, regarding your first point, I absolutely agree. I did, however, follow the 'convention' of step_target_fps_parallelization() and assumed that the target cycles would be resolved in the same manner. If this has/is going to be changed I am happy to make that change. For your second and third point, thanks a lot - I've added changes addressing those.

Jun 15 '21 19:06 neilkimn

finn finn copied to clipboard

New method for automatically setting folding factors

finn
finn copied to clipboard