yolo2_light Int8-inference on Tensor Core

I tried quantized YOLOv3 on Volta GPU. But, it didn't seem to be run on Tensor Core. CUDNN documentation in 2.8.2 recommends to use "CUDNN_DATA_INT8x32 " for Tensor Core operations. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html

I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?

Oct 19 '18 13:10 daniel89710

I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?

I think yes, should be used CUDNN_DATA_INT8x32 and is supported only Xavier GPU CC 7.2, are not supported Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)

https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

It seems Tensor Cores for INT-8 supported only for CC 7.2: Jetson Xavier, DRIVE PX Xavier/Pegasus with Xavier SoC https://en.wikipedia.org/wiki/CUDA#GPUs_supported As I understand, it doesn't support: Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)
is required to use here CUDNN_DATA_INT8x32 if channels % 32 == 0:
- https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/additionally.c#L1690-L1695
- https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/yolov2_forward_network_gpu.cu#L604

Oct 19 '18 14:10 AlexeyAB

Thank you for your reply. I tried on Tesla V100. So, I will try on Xavier. Should I change only CUDNN_DATA_INT8x4 into CUDNN_DATA_INT8x32?

Oct 20 '18 13:10 daniel89710

@daniel89710 I added fix, update your code from GitHub, and un-comment these 2 lines:

https://github.com/AlexeyAB/yolo2_light/blob/1f270bf1c1c04dde3fbc779dd0ce90f85490301a/src/yolov2_forward_network_gpu.cu#L604
https://github.com/AlexeyAB/yolo2_light/blob/1f270bf1c1c04dde3fbc779dd0ce90f85490301a/src/additionally.c#L1787

So for layers with channels multiple of 32 will be used Tensor Cores (Xavier), for layers with channels multiple of 4 will be used DP4A (Pascal and higher).

Also check that you use cuDNN >= 7.2

Oct 20 '18 22:10 AlexeyAB

@AlexeyAB Thank you for quick update. I will check.

Oct 21 '18 11:10 daniel89710

Hi, @AlexeyAB

I tried your codes on Xavier after update code and un-comment 2 lines. However, I got a failure at checkCUDNN (transform_status) in yolo2_foward_newtork_gpu.cu :L626. Error: 3 - CUDNN_STATUS_BAD_PARAM.

Do you have any ideas?

Oct 23 '18 01:10 daniel89710

@daniel89710 Hi,

This is very similar to a bug in cuDNN.

I tried cuDNN 7.3.1 for CUDA 10 + CUDA10 + MSVS2015 on Windows 7 x64. And I even can't create any descriptor for CUDNN_DATA_INT8x32. It always returns (desc_status==CUDNN_STATUS_BAD_PARAM).

cudnnTensorDescriptor_t desc;

cudnnCreateTensorDescriptor(&desc);

cudnnStatus_t desc_status = 
 cudnnSetTensor4dDescriptor(desc, CUDNN_TENSOR_NCHW_VECT_C, CUDNN_DATA_INT8x32, 128, 128, 3, 3);

Oct 23 '18 23:10 AlexeyAB

@daniel89710

Also what acceleration do you get by using -quantized flag with default CUDNN_DATA_INT8x4 (without CUDNN_DATA_INT8x32) on Xavier / Volta / Turing compared to FP32 calculation (without -quantized flag) ?

There is written that we should use INT8x32 only to accelerate inference, but there is not written that this is mandatory condition to use Tensor Cores. It seems both INT8x4 and INT8x32 will use Tensor Cores, but INT8x32 will be faster: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

CUDNN_DATA_INT8x32 to accelerate your convolution computation

Oct 25 '18 17:10 AlexeyAB

@AlexeyAB Sorry for late. I tried to use -quantized flag on Xavier and I got a little speed down compared with no -quantized flag. I think we cannot still use Tensor core in the case of using -quantized flag. I checked whether this ran on Tensor core or not using nvprof. But, that seems to be executed on only CUDA core without Tensor core.

Oct 30 '18 12:10 daniel89710

yolo2_light yolo2_light copied to clipboard

Int8-inference on Tensor Core

yolo2_light
yolo2_light copied to clipboard