finn
finn copied to clipboard
New method for automatically setting folding factors
Background: https://github.com/Xilinx/finn/issues/297
This transformation sets the folding factors for a network and guarantees that constraints are satisfied - both on a node-level and between nodes. It has been tested against the CNV networks and MobileNet-v1 and the folding factors created look good. The algorithm maps out every single "legal" folding configuration for the nodes and iteratively updates the slowest nodes to increase overall network throughput - hence the name SetFoldingExhaustive
Caveats:
-
Included small dictionary with
boards
to briefly show how resource limits can be taken into account during optimization. Will most likely need something like https://github.com/Xilinx/finn-experimental/blob/main/src/finn/util/platforms.py in order to have a more rigid structure around the boards. If a board is not passed, the target cycle count will suffice as a target for the optimization. -
Nodes that do not have
lut_estimation()
- for these nodes, I made the assumption that they have insignificant resource usage and they'd be updated during optimization if necessary. Possible to model their resource usage post-synthesis for different ranges of folding configs and then use that relationship to estimate resources, but maybe there's a reason for them not to have alut_estimation()
method? -
As with
AllocateResources
, the algorithm makes use of a downscaling factor. Rather than 0.7, it is set to 0.85. Maybe need to set it lower if point above does not hold and those nodes have a significant resource usage.
For future work in this space, I think it would be interesting to start taking more resources into account (BRAM, DSPs, URAM) as these resources are not being set automatically in most cases. I also believe this setup would be useful here, since the data structure containing the possible folding factors for a node, also holds the estimated resource use for that folding configuration. Thus the algorithm could be extended to make use of these resource estimates as well.
The very high-level steps of the algorithm are shown here. Taken from our BSc thesis where the work only applied to CNV networks so it might differ slightly since support for MobileNet-v1 has been included in the code.
Thanks Neil! I want to make sure we give this a thorough review as it's a critical part of FINN, so I've requested a few extra reviews on this from colleagues.
Is there a reason why a new platforms dictionary was added instead of using the one in finn-experimental? @maltanar Should we be proactive and plan to merge the finn-experimental platforms definition into FINN with this PR, instead of adding yet another platform definition dictionary? At minimum @neilkimn should consider reworking the PR to use the platform definitions in finn-experimental (since FINN docker now provides finn-experimental).
Finally, PR #346 is related to this one as it fixes resource estimates for external-weights FC layers and increments finn-experimental version. I think that PR should be merged first.
Hi @neilkimn , thank you a lot for your pull-request. This feature seems to be something that indeed many people are interested in. As example, I also implemented a variation of this during my master thesis, but the code base is not as polished as yours. So far I only got to take a top-level look at your implementation, but a few questions and comments already came up:
- It might make more sense to use
target_fps
instead oftarget_cycles_per_frame
, since theclk_ns
is also given. This would make it easier for users to understand what they are influencing and this is also what theDataflowBuildConfig
uses. - During testing I noticed that the transformation will fail with a cryptic
TypeError
, if the nodes in the model graph don't have names. As such it would be necessary that theSetFoldingExhaustive
implementation either executesGiveUniqueNodeNames
before starting or confirms on it's own that all nodes have names. - In some cases
finn-hls
also has boundary conditions for the minimum number of theSIMD
parameter. These stem from the fact thatvivado_hls
enforces thatARRAY_PARTITION COMPLETE
variables are at most 1024 elements large, see this form post: https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/ARRAY-PARTITION-COMPLETE-has-exceeded-the-threshold-1024/td-p/862145 The only case that I found where the folding is impacted by this in practice is in theMatrix_Vector_Activate_Batch
function, here: https://github.com/Xilinx/finn-hlslib/blob/b37337c571b98f40423020bc79f97e189f2661d5/mvau.hpp#L108 This means that the following relation should be obeyed for theStreamingFCLayer_Batch
nodes:SIMD > mw / 1024
. Otherwise the synthesis is bound to fail. This is in particular important when setting the folding factors from scratch.
It would be great if you could take a closer look at these comments.
Again, thank you a lot for your PR. It is highly appreciated!
Hi all,
Thanks for the thorough feedback and insightful comments. @quetric, I largely agree with your points on using the information supplied by the platforms in finn-experimental
and have thus changed the instantiation of the class to not make use of any board definitions for now. Until more has been decided on incorporating what's available in finn-experimental
, I belive this will suffice (it does however move the 'issue' of utilizing the board definitions further upstream and the user can, of course, go through the build steps without specifying a board.)
@HenniOVP, regarding your first point, I absolutely agree. I did, however, follow the 'convention' of step_target_fps_parallelization()
and assumed that the target cycles would be resolved in the same manner. If this has/is going to be changed I am happy to make that change. For your second and third point, thanks a lot - I've added changes addressing those.