YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy

Open levipereira opened this issue 3 months ago • 6 comments

@WongKinYiu

I have developed the initial version of YOLOv9-QAT using the Q/DQ method, tailored specifically for YOLOv9 models intended for execution solely on TensorRT.
This implementation currently supports only the Inference Models (Converted and Gelan models).

The source code in available the yolov9-qat branch.

Challenges

Quantizing all layers in some cases can decreases accuracy and increases latency, primarily due to the complexity of the last layer. To mitigate this, utilize the qat.py quantize --no-last-layer flag to exclude the last layer from quantization.

This version we have unoptimized scaling of Quantize/Dequantize (Q/DQ) could lead to generating unnecessary data formats. Implementing restrictions on the scale of Q/DQ on models/quantize.py to match the data format is essential to decrease latency perfomance. The contributions from the community, as their knowledge is essential for the correct implementation of this functionality.

Files Added / Modified

qat.py - Main

usage: qat.py [-h] {quantize,sensitive,eval} ...
positional arguments:
  {quantize,sensitive,eval}
    quantize            PTQ/QAT finetune ...
    sensitive           Sensitive layer analysis
    eval                Do evaluate

models/quantize.py - Quantize Module models/quantize_rules.py - Quantize Rules export.py - Changed to Automatically detect QAT Models and Export when using flag --include onnx / onnx_end2end

Accuracy Report

QAT YOLOV9-C - ALL LAYERS 
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634
PQT        | 0.5295   | 0.6978   | 0.7455     | 0.6306
QAT- Best  | 0.5291   | 0.6978   | 0.7449     | 0.632

QAT - YOLOV9-C  - NO QAT LAST LAYER 
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634   
PQT        | 0.529    | 0.698    | 0.7459     | 0.6297  
QAT- Best  | 0.5299   | 0.6984   | 0.7469     | 0.6305  

QAT - YOLOV9-E ALL-LAYERS
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649
PQT        | 0.5565   | 0.7241   | 0.7499     | 0.6649
QAT- Best  | 0.5566   | 0.7232   | 0.7538     | 0.6637


QAT - YOLOV9-E  - NO QAT  LAST LAYER
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649  
PQT        | 0.5569   | 0.7242   | 0.7497     | 0.6646  
QAT- Best  | 0.5569   | 0.7239   | 0.7486     | 0.6657

Result using TensorRT engine Models on Triton-Server Tool: https://github.com/levipereira/triton-client-yolo

========================= EVALUATION SUMMARY - YOLOV9-C ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.577
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.361
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.582
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.689
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.652
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.701
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.538
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.759
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.848
================================================================================
[email protected]:0.95: 0.528
[email protected]:      0.701
[email protected]:     0.577
================================================================================


========================= EVALUATION SUMMARY - YOLOV9-C-QAT ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.699
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.576
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.359
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.581
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.692
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.651
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.699
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.534
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.758
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.845
================================================================================
[email protected]:0.95: 0.528
[email protected]:      0.699
[email protected]:     0.576
================================================================================

Latency Report

Device Properties:
- Selected Device: NVIDIA GeForce RTX 4090
  - Compute Capability: 8.9
  - SMs: 128.0
  - Compute Clock Rate: 2.58
  - Device Global Memory: 24207 MiB
  - Shared Memory per SM: 100 KiB
  - Memory Bus Width: 384.0
  - Memory Clock Rate: 10.501

Table Info:

"Average time": refers to the sum of the layer latencies, when profiling layers separately.
"Throughput": is measured in inferences per second (IPS).

Origin

Model	Precision Type	Batch Size	Layers	Weights (MB)	Activations (MB)	Throughput (IPS)	Total Throughput (IPS)	Average time (ms)
yolov9-c	FP16	1	271	48.2	611.7	792	792	2.1
		8	273	48.2	4809.1	151	1209	7.3

yolov9-e	FP16	8	477	109.3	13461.3	57	457	18.8
		1	487	109.3	1706.5	353	353	4.3

Last Layer not Quantized

Model	Precision Type	Batch Size	Layers	Weights (MB)	Activations (MB)	Throughput (IPS)	Total Throughput (IPS)	Average time (ms)
yolov9-c-qat	FP16 INT8	1	288	29.4	534.7	951	951	1.9
		8	287	29.4	4190.2	181	1447	6.4

yolov9-e-qat	FP16 INT8	1	526	63.1	1757.0	405	405	4.1
		8	526	63.1	13407.7	60	482	18.2

All Layers Quantized

Model	Precision Type	Batch Size	Layers	Weights (MB)	Activations (MB)	Throughput (IPS)	Total Throughput (IPS)	Average time (ms)
yolov9-c-qat	FP16 INT8	1	295	24.2	540.1	957	957	1.9
		8	293	24.2	4216.7	193	1547	6.1

yolov9-e-qat	FP16 INT8	1	532	57.8	1779.5	396	396	4.1
		8	532	57.8	13431.8	62	493	17.8

Mar 15 '24 21:03 levipereira

Added:

Two repositories to test YOLOv9 QAT Models

Triton-Server: Deploy Models on TensorRT format. https://github.com/levipereira/triton-server-yolo
Triton Client: Allows users to evaluate coco dataset or inference their own images/videos. https://github.com/levipereira/triton-client-yolo

Mar 18 '24 01:03 levipereira

Thanks for sharing, it would be better if there is export onnx for independent deployment, not just triton

Mar 18 '24 02:03 ou525

@levipereira It would be interesting to see how the performance on triton compares with Yolov7-QAT , since the paper does not talk about it and neither does #143 .

Mar 19 '24 20:03 trivedisarthak

@levipereira Thank you for your contribution. I need to ask a question, Do I have to train model in order to get a quantized model?

Mar 25 '24 07:03 demuxin

@demuxin Yes.

Mar 26 '24 14:03 levipereira

@trivedisarthak check OP

Mar 26 '24 14:03 levipereira

The Original Implementation in #327

Apr 06 '24 14:04 levipereira

yolov9 yolov9 copied to clipboard

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy

Challenges

Files Added / Modified

Accuracy Report

Latency Report

Origin

Last Layer not Quantized

All Layers Quantized

yolov9
yolov9 copied to clipboard