nncf
nncf copied to clipboard
nncf inference with BN folding?
Will the BN folding cause accuracy loss during inference phase? We want to fold the BN to weights to improve inference speed because the BN layer will cost extra memory and instruction cycles.
Hi @xiaoyaopeng,
OpenVINO does it for you automatically for floating-point and quantized models. And there is no accuracy degradation if you use NNCF.
Thanks.
One more thing, OpenVINO will fold BN to weights? if we use per tensor quantization, what will OpenVINO do to the weights? Will the scale parameter be folded too? Note that the dimention of BN parameter is the same as channel num of weights, but we only have one scale parameter for one layer.
This can be a problem actually because OpenVINO mostly fuses BN into FakeQuantize parameters. To be honest of the HW we have supports per-channel quantization of weights as the most accurate scheme.
In the classification example, compress_ctrl can export onnx model by using function _export_to_onnx() in class PTExporter. Will the quantization parameters be exported to onnx model? Can OpenVINO parse these parameters? OpenVINO has model optimization such as BN folding, but I can't find an introduction which openvino can work with nncf compressed model in openvino docs.
NNCF can export quantization parameters either to standard ONNX with QuantizeLinerar and DequantizeLinear operations or to custom ONNX with FakeQuantize op from openvino domain. Both are recognizable by OpenVINO. As for the BN folding, this is the responsibility of the Model Optimizer component within OpenVINO.