yolo2_light
yolo2_light copied to clipboard
INT8 implementation reference
Hi, @AlexeyAB
I'd like to know more about how INT8 version is implemented. Is it based on one/more papers? Could you give related links for reference?
Thanks
@trustin77 Hi,
I have not seen step by step instructions on how to do this. I used these documentations:
-
How Float-32 is converted to the INT-8 in the TensorRT: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
-
How to use
CUDNN_DATA_INT8x4
in cuDNN: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionForward -
How to convert
CUDNN_TENSOR_NCHW
&INT8
toCUDNN_TENSOR_NCHW_VECT_C
&INT8x4
: https://devtalk.nvidia.com/default/topic/1028139/cudnn/how-to-reduce-time-spent-in-transforming-tensors-using-cudnnv6-0-for-api-cudnntransformtensor-/post/5264978/#5264978
About optimzal input_calibration: https://github.com/AlexeyAB/yolo2_light/issues/24#issuecomment-435361415
Also about quantization:
-
Yolo v2 INT8 - too high a reduction of accuracy: http://cs231n.stanford.edu/reports/2017/pdfs/808.pdf
-
optimal quantization is INT 4-bit: https://arxiv.org/abs/1510.00149
-
XNOR BIT1 quantization - This motivates us to avoid binarization at the first and last layer of a CNN: https://arxiv.org/abs/1603.05279
-
MobileNet quantization: https://arxiv.org/abs/1712.05877
-
Quantization of old models: https://arxiv.org/abs/1512.06473
-
About XNOR: https://arxiv.org/abs/1807.03010
-
Also about XNOR: https://arxiv.org/abs/1803.05849