Large error increase upon synthesis and deployment.
Hi all,
I have been working on designing a regression network in Brevitas and deploying it using FINN. My code can be found at the following repo: https://github.com/kf7lsu/pytorchFACILE. The model is defined in quant_modelV2.py, training occurs in quant_train.ipynb, pre-synthesis transformations occur in quant_cleanup.ipynb, and synthesis occurs in quant_syn.ipynb.
The data fed into the network quantized into an unsigned 4-bit integer using functions in processing_for_train.py and proc_for_infer.py in Pytorch and Numpy formats respectively. They just shift the values and dividing by an incremental value before rounding to an integer value.
After I perform the pre-synthesis transformations in quant_cleanup.ipynb, I check the accuracy of the model using the provided execute_onnx function, measuring the Mean Squared Error (MSE) on the training validation dataset. In the most recent test, the MSE was 545.8 after the transformations.
From there, I synthesize the model using the build_dataflow tool. I am targeting an Ultra96-V2 board with a PYNQ shell.
The output_final directory is committed and pushed to the Github repository and pulled down on the PYNQ board over the internet. From there I perform the throughput tests and test the MSE on the FPGA. In the most recent test, on-FPGA MSE with the same dataset as the post-cleanup test, was 11,980.7.
Update: I was encountering an error with training at large batch sizes and inferring at low batch sizes. After fixing that issue and increasing the quantizaiton bitwidth to 6b, I was getting better results. MSE of ~100 post-training and post-transformations (MAE of ~10).
That said, I am still getting a notable increase in error after synthesis and deployment with a MSE of ~2700 (MAE of ~50).
Update: I created plots of expected result vs predicted result to determine how much of a performance hit is occurring during synthesis. The results are taken from the network after the cleanup transformations are performed and after being deployed on a PYNQ FPGA. Here's the link to my findings: link
Hi @kf7lsu , thanks for taking the time for reporting this in detail -- we've run relatively few regression NNs through FINN and I suspect you've found a few of the bugs around that. @fpjentzsch will have a closer look.
Hi @kf7lsu , I found 2 separate problems that cause the erroneous behavior:
- The driver is initialized with the wrong input shapes. With the current folding config, it should be
"ishape_folded" : (1, 14, 1)and"ishape_packed" : (1, 14, 1)instead of"ishape_folded" : (1, 1, 14)and"ishape_packed" : (1, 1, 11). The packed shape is the same in this case because only the last dimension is packed. - After streamlining, there remains a Mul node at the very end of the graph, as it cannot be absorbed into any subsequent FC or thresholding layer. Currently, there is no convert_to_hls transformation to handle such lonesome Mul nodes, so it is simply omitted during synthesis of the dataflow partition. Note that synthesis would fail entirely if this were to happen within the middle of the graph, but in this case the node is still left inside the surrounding "dataflow_parent.onnx" (which you can inspect under "intermediate_models/"). This is also the reason why Python-, cpp-, and rtl-based simulation yield the same results. They operate on the complete ONNX graph.
This also means that the simulation output is a floating point value, so the batch_out.astype("int8") is actually increasing the MSE. When I leave this cast out and apply the final Mul node in software (*0.9649 in this case), post-transformation and hardware implementation reach the same MSE of ~108. Interestingly, the MSE is only ~99 if I don't apply the final scaling.
Thank you for the quick response. Interestingly, I was modeling that config off of the automatically generated driver. Is there a different location where you were able to find the correct folding configuration, or is that something you were able to determine just looking at the network? After making that modification, I am now getting correct MSE numbers. Thank you.
I see, then you ran into a bug in the driver generation that we recently fixed in #301. In the current dev branch the driver should correctly infer the shapes.
In any case, you could manually inspect the "folded_shape" attribute of the first StreamingFIFO node in the dataflow partition. If your graph starts with a different node without such an attribute (e.g. an FCLayer), you can easily calculate the shape from the folding attributes by calling get_folded_input_shape() on the node.
I am closing this issue due to inactivity. Please feel free to reopen or create a new issue, if the problem persists!