FracBNN icon indicating copy to clipboard operation
FracBNN copied to clipboard

软硬件参数适配

Open 28DaaLong opened this issue 1 year ago • 5 comments

张老师,您好!首先非常感谢您的开源,我最近阅读了您的Fracbnn论文,将网络进行训练并部署在了HLS上,但我目前还有以下一些问题,希望能得到您的解答: (1)首先我使用了您提供的硬件参数将网络部署到HLS上,并达到了不错的准确率(95.1%)。接着我在软件上重新训练了一遍网络,参数均采用默认,epoch=260,得到了新的网络参数,试图将其部署在HLS上。目前将卷积权重按照输入通道的不同拼接在一起,末位补0,形成uint64的形式,将权重参数与其余参数装载到HLS上,发现网络对于所有输入图片的十分类结果都一样,结果为-20.5625 -6.83594 -7.55469 -5.4375 -15.1016 3.5625 -25.4453 -9.15625 -23.2344 -18.3984 ,我想知道其中的问题出在了哪里,是否是需要将软件训练的超参数进行一定的修改才能将其部署在HLS上。我在之后的实验发现,仅将layer3_2中的所有bn参数和全连接层参数替换为新训练参数,其余均使用预先给好的参数,就会出现以上问题。 (2)最后想了解在参与硬件测试的input.bin和label.bin文件的输入形式是怎样的,不知道您是否方便提供。 期待您的答复

28DaaLong avatar Jan 06 '24 11:01 28DaaLong

Hi,

Thanks for reading our FracBNN paper and trying out the FPGA deployment! For the interest of a potentially broader audience, I will respond in English. There are some missing details for me to fully answer the questions, but I'll give my best attempt as follows:

Why am I getting the same classification results for all input images?

The first thing to check is probably the mean square error between the HLS C and PyTorch model outputs. Maybe input images are indeed from the same class. If the error is large, then there might be something wrong with the weight packing after model retraining.

Is it required to modify training hyperparameters in order to deploy the model on FPGA?

No. Training hyperparameters will only affect model accuracy and is independent of FPGA deployment. Whatever accuracy you get from PyTorch training, you should be able to replicate it on FPGA, assuming weight loading etc. is correct.

Replacing BN and FC layer weights in layer3_2 with in-house fresh-trained weights will have the issue in the 1st question.

Not sure if I follow. Does the "same" issue means the output logits are exactly the same in both cases, or the classification results are the same? If latter, please refer to my answer to the first question.

What is the format of input.bin and label.bin?

Label.bin is the output label. Since it is a single digit that is only for evaluating the model, you could use whatever format that you are comfortable with. Input.bin contains the channel packed input images. In FracBNN we binarized input RGB images using the thermometer encoding. As a result each input image to the FPGA accelerator has 96 channels, each with 1-bit pixels. We pack those channels into int64 as well. You could refer to the paper how thermometer encoding works, or call this function with resolution set to 8 to view the output shape.

Best, Yichi

ychzhang avatar Jan 06 '24 23:01 ychzhang

感谢您的回复! 是的,硬件网络对于不同的输入图片(类别不相同),输出的十分类结果的数值是完全一样的,为了保证测试的环节没有其他因素的影响,我使用了您给出的testbench文件(本身没有任何问题,使用原参数能够正确分类),并依然使用您给出的权重(使用uint64包装好的,没有进行修改),仅将将 layer3_2 中的 BN 和 FC 层权重替换为内部新训练的权重,问题就出现了,且与参数全部替换的十分类结果相同,此结果似乎表明网络新更换的BN与FC层对于输入数据完全没有分类能力,并将计算结果转变为同一数值。 于是问题似乎指向了软件端,但是我并没有对软件端代码进行修改(仅改变了超参数),导出的浮点参数统一保留了8位有效数字。这个结果令我感到困惑,不知道问题还会出在哪里。 下图给出了testbench的部分输出结果,希望能得到您的指点。

pic

期待您的回复

28DaaLong avatar Jan 07 '24 03:01 28DaaLong

In your screenshot, I guess "software" means the C testbench. Software outputs looks good. I'd guess hardware weight packing has an issue when you plug in your own weights?

ychzhang avatar Jan 20 '24 05:01 ychzhang

在您的屏幕截图中,我猜“软件”是指 C 测试平台。软件输出看起来不错。我猜硬件重量包装在插入自己的重量时有问题吗?

感谢您的回复,是的,经过检验,在BN参数打包过程中出现了问题,在修改过后,问题得到解决。 同时,我想在这提出一个新的问题,希望能得到您的解答:首先,我在软件端训练得到了不错的准确率参数模型,但是参数在导入硬件后发生了10%左右的精度损失(参数打包以及导入方式已经被验证正确),这一问题可能出在哪里,是否与软件的训练方式有关。 期待您的回复!

28DaaLong avatar Apr 28 '24 01:04 28DaaLong

Thank you for your reply. Yes, after inspection, there was a problem in the BN parameter packaging process. After modification, the problem was solved. At the same time, I would like to raise a new question here, hoping to get your answer: First of all, I trained a parameter model with good accuracy on the software side, but the parameters suffered an accuracy loss of about 10% after they were imported into the hardware ( The parameter packaging and import methods have been verified to be correct), where may this problem occur, and whether it is related to the training method of the software. Looking forward to your reply

28DaaLong avatar Apr 28 '24 01:04 28DaaLong