yolo2_light
yolo2_light copied to clipboard
Int8-inference on Tensor Core
I tried quantized YOLOv3 on Volta GPU. But, it didn't seem to be run on Tensor Core. CUDNN documentation in 2.8.2 recommends to use "CUDNN_DATA_INT8x32 " for Tensor Core operations. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?
I think CUDNN support only "CUDNN_DATA_INT8x32 " to run on Tensor Core for INT-8. Is this correct?
I think yes, should be used CUDNN_DATA_INT8x32
and is supported only Xavier GPU CC 7.2, are not supported Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)
https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips
-
It seems Tensor Cores for INT-8 supported only for CC 7.2: Jetson Xavier, DRIVE PX Xavier/Pegasus with Xavier SoC https://en.wikipedia.org/wiki/CUDA#GPUs_supported As I understand, it doesn't support: Tesla V100 (CC 7.0) and GeForce RTX 2080 Ti - 2070 (CC 7.5)
-
is required to use here
CUDNN_DATA_INT8x32
ifchannels % 32 == 0
:- https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/additionally.c#L1690-L1695
- https://github.com/AlexeyAB/yolo2_light/blob/fb565fa43d76a8d1f3f741dbfdfefdb4cdfbdf4b/src/yolov2_forward_network_gpu.cu#L604
Thank you for your reply. I tried on Tesla V100. So, I will try on Xavier. Should I change only CUDNN_DATA_INT8x4 into CUDNN_DATA_INT8x32?
@daniel89710 I added fix, update your code from GitHub, and un-comment these 2 lines:
- https://github.com/AlexeyAB/yolo2_light/blob/1f270bf1c1c04dde3fbc779dd0ce90f85490301a/src/yolov2_forward_network_gpu.cu#L604
- https://github.com/AlexeyAB/yolo2_light/blob/1f270bf1c1c04dde3fbc779dd0ce90f85490301a/src/additionally.c#L1787
So for layers with channels multiple of 32 will be used Tensor Cores (Xavier), for layers with channels multiple of 4 will be used DP4A (Pascal and higher).
Also check that you use cuDNN >= 7.2
@AlexeyAB Thank you for quick update. I will check.
Hi, @AlexeyAB
I tried your codes on Xavier after update code and un-comment 2 lines. However, I got a failure at checkCUDNN (transform_status) in yolo2_foward_newtork_gpu.cu :L626. Error: 3 - CUDNN_STATUS_BAD_PARAM.
Do you have any ideas?
@daniel89710 Hi,
This is very similar to a bug in cuDNN.
I tried cuDNN 7.3.1 for CUDA 10 + CUDA10 + MSVS2015 on Windows 7 x64.
And I even can't create any descriptor for CUDNN_DATA_INT8x32
. It always returns (desc_status==CUDNN_STATUS_BAD_PARAM
).
cudnnTensorDescriptor_t desc;
cudnnCreateTensorDescriptor(&desc);
cudnnStatus_t desc_status =
cudnnSetTensor4dDescriptor(desc, CUDNN_TENSOR_NCHW_VECT_C, CUDNN_DATA_INT8x32, 128, 128, 3, 3);
@daniel89710
Also what acceleration do you get by using -quantized
flag with default CUDNN_DATA_INT8x4
(without CUDNN_DATA_INT8x32
) on Xavier / Volta / Turing compared to FP32 calculation (without -quantized
flag) ?
There is written that we should use INT8x32 only to accelerate inference, but there is not written that this is mandatory condition to use Tensor Cores. It seems both INT8x4 and INT8x32 will use Tensor Cores, but INT8x32 will be faster: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips
CUDNN_DATA_INT8x32 to accelerate your convolution computation
@AlexeyAB Sorry for late. I tried to use -quantized flag on Xavier and I got a little speed down compared with no -quantized flag. I think we cannot still use Tensor core in the case of using -quantized flag. I checked whether this ran on Tensor core or not using nvprof. But, that seems to be executed on only CUDA core without Tensor core.