yolo2_light icon indicating copy to clipboard operation
yolo2_light copied to clipboard

Int8-inference on Tensor Core

Open daniel89710 opened this issue 6 years ago • 8 comments

I tried quantized YOLOv3 on Volta GPU. But, it didn't seem to be run on Tensor Core. CUDNN documentation in 2.8.2 recommends to use "CUDNN_DATA_INT8x32 " for Tensor Core operations. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html

I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?

daniel89710 avatar Oct 19 '18 13:10 daniel89710

I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?

I think yes, should be used CUDNN_DATA_INT8x32 and is supported only Xavier GPU CC 7.2, are not supported Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)


https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

  • It seems Tensor Cores for INT-8 supported only for CC 7.2: Jetson Xavier, DRIVE PX Xavier/Pegasus with Xavier SoC https://en.wikipedia.org/wiki/CUDA#GPUs_supported As I understand, it doesn't support: Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)

  • is required to use here CUDNN_DATA_INT8x32 if channels % 32 == 0:

    • https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/additionally.c#L1690-L1695
    • https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/yolov2_forward_network_gpu.cu#L604

AlexeyAB avatar Oct 19 '18 14:10 AlexeyAB

Thank you for your reply. I tried on Tesla V100. So, I will try on Xavier. Should I change only CUDNN_DATA_INT8x4 into CUDNN_DATA_INT8x32?

daniel89710 avatar Oct 20 '18 13:10 daniel89710

@daniel89710 I added fix, update your code from GitHub, and un-comment these 2 lines:

  • https://github.com/AlexeyAB/yolo2_light/blob/1f270bf1c1c04dde3fbc779dd0ce90f85490301a/src/yolov2_forward_network_gpu.cu#L604
  • https://github.com/AlexeyAB/yolo2_light/blob/1f270bf1c1c04dde3fbc779dd0ce90f85490301a/src/additionally.c#L1787

So for layers with channels multiple of 32 will be used Tensor Cores (Xavier), for layers with channels multiple of 4 will be used DP4A (Pascal and higher).

Also check that you use cuDNN >= 7.2

AlexeyAB avatar Oct 20 '18 22:10 AlexeyAB

@AlexeyAB Thank you for quick update. I will check.

daniel89710 avatar Oct 21 '18 11:10 daniel89710

Hi, @AlexeyAB

I tried your codes on Xavier after update code and un-comment 2 lines. However, I got a failure at checkCUDNN (transform_status) in yolo2_foward_newtork_gpu.cu :L626. Error: 3 - CUDNN_STATUS_BAD_PARAM.

Do you have any ideas?

daniel89710 avatar Oct 23 '18 01:10 daniel89710

@daniel89710 Hi,

This is very similar to a bug in cuDNN.

I tried cuDNN 7.3.1 for CUDA 10 + CUDA10 + MSVS2015 on Windows 7 x64. And I even can't create any descriptor for CUDNN_DATA_INT8x32. It always returns (desc_status==CUDNN_STATUS_BAD_PARAM).

cudnnTensorDescriptor_t desc;

cudnnCreateTensorDescriptor(&desc);

cudnnStatus_t desc_status = 
 cudnnSetTensor4dDescriptor(desc, CUDNN_TENSOR_NCHW_VECT_C, CUDNN_DATA_INT8x32, 128, 128, 3, 3);

AlexeyAB avatar Oct 23 '18 23:10 AlexeyAB

@daniel89710

Also what acceleration do you get by using -quantized flag with default CUDNN_DATA_INT8x4 (without CUDNN_DATA_INT8x32) on Xavier / Volta / Turing compared to FP32 calculation (without -quantized flag) ?

There is written that we should use INT8x32 only to accelerate inference, but there is not written that this is mandatory condition to use Tensor Cores. It seems both INT8x4 and INT8x32 will use Tensor Cores, but INT8x32 will be faster: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

CUDNN_DATA_INT8x32 to accelerate your convolution computation

AlexeyAB avatar Oct 25 '18 17:10 AlexeyAB

@AlexeyAB Sorry for late. I tried to use -quantized flag on Xavier and I got a little speed down compared with no -quantized flag. I think we cannot still use Tensor core in the case of using -quantized flag. I checked whether this ran on Tensor core or not using nvprof. But, that seems to be executed on only CUDA core without Tensor core.

daniel89710 avatar Oct 30 '18 12:10 daniel89710