yolov9 icon indicating copy to clipboard operation
yolov9 copied to clipboard

YOLOv9-QAT TensorRT Q/DQ: Improved Speed and Zero Loss Accuracy

Open levipereira opened this issue 3 months ago • 6 comments

@WongKinYiu

I have developed the initial version of YOLOv9-QAT using the Q/DQ method, tailored specifically for YOLOv9 models intended for execution solely on TensorRT.
This implementation currently supports only the Inference Models (Converted and Gelan models).

The source code in available the yolov9-qat branch.

Challenges

Quantizing all layers in some cases can decreases accuracy and increases latency, primarily due to the complexity of the last layer. To mitigate this, utilize the qat.py quantize --no-last-layer flag to exclude the last layer from quantization.

This version we have unoptimized scaling of Quantize/Dequantize (Q/DQ) could lead to generating unnecessary data formats. Implementing restrictions on the scale of Q/DQ on models/quantize.py to match the data format is essential to decrease latency perfomance. The contributions from the community, as their knowledge is essential for the correct implementation of this functionality.

Files Added / Modified

qat.py - Main

usage: qat.py [-h] {quantize,sensitive,eval} ...
positional arguments:
  {quantize,sensitive,eval}
    quantize            PTQ/QAT finetune ...
    sensitive           Sensitive layer analysis
    eval                Do evaluate

models/quantize.py - Quantize Module models/quantize_rules.py - Quantize Rules export.py - Changed to Automatically detect QAT Models and Export when using flag --include onnx / onnx_end2end

Accuracy Report

QAT YOLOV9-C - ALL LAYERS 
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634
PQT        | 0.5295   | 0.6978   | 0.7455     | 0.6306
QAT- Best  | 0.5291   | 0.6978   | 0.7449     | 0.632

QAT - YOLOV9-C  - NO QAT LAST LAYER 
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5297   | 0.699    | 0.7432     | 0.634   
PQT        | 0.529    | 0.698    | 0.7459     | 0.6297  
QAT- Best  | 0.5299   | 0.6984   | 0.7469     | 0.6305  

QAT - YOLOV9-E ALL-LAYERS
Eval Model | AP       | AP50     | Precision  | Recall
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649
PQT        | 0.5565   | 0.7241   | 0.7499     | 0.6649
QAT- Best  | 0.5566   | 0.7232   | 0.7538     | 0.6637


QAT - YOLOV9-E  - NO QAT  LAST LAYER
Eval Model | AP       | AP50     | Precision  | Recall  
-------------------------------------------------------
Origin     | 0.5576   | 0.7246   | 0.7547     | 0.6649  
PQT        | 0.5569   | 0.7242   | 0.7497     | 0.6646  
QAT- Best  | 0.5569   | 0.7239   | 0.7486     | 0.6657  



Result using TensorRT engine Models on Triton-Server Tool: https://github.com/levipereira/triton-client-yolo

========================= EVALUATION SUMMARY - YOLOV9-C ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.577
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.361
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.582
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.689
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.652
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.701
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.538
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.759
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.848
================================================================================
[email protected]:0.95: 0.528
[email protected]:      0.701
[email protected]:     0.577
================================================================================


========================= EVALUATION SUMMARY - YOLOV9-C-QAT ========================
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.699
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.576
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.359
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.581
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.692
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.392
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.651
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.699
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.534
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.758
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.845
================================================================================
[email protected]:0.95: 0.528
[email protected]:      0.699
[email protected]:     0.576
================================================================================

Latency Report

  • Device Properties:
    • Selected Device: NVIDIA GeForce RTX 4090
      • Compute Capability: 8.9
      • SMs: 128.0
      • Compute Clock Rate: 2.58
      • Device Global Memory: 24207 MiB
      • Shared Memory per SM: 100 KiB
      • Memory Bus Width: 384.0
      • Memory Clock Rate: 10.501

Table Info:

  • "Average time": refers to the sum of the layer latencies, when profiling layers separately.
  • "Throughput": is measured in inferences per second (IPS).

Origin

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c FP16 1 271 48.2 611.7 792 792 2.1
8 273 48.2 4809.1 151 1209 7.3
yolov9-e FP16 8 477 109.3 13461.3 57 457 18.8
1 487 109.3 1706.5 353 353 4.3

Last Layer not Quantized

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c-qat FP16 INT8 1 288 29.4 534.7 951 951 1.9
8 287 29.4 4190.2 181 1447 6.4
yolov9-e-qat FP16 INT8 1 526 63.1 1757.0 405 405 4.1
8 526 63.1 13407.7 60 482 18.2

All Layers Quantized

Model Precision Type Batch Size Layers Weights (MB) Activations (MB) Throughput (IPS) Total Throughput (IPS) Average time (ms)
yolov9-c-qat FP16 INT8 1 295 24.2 540.1 957 957 1.9
8 293 24.2 4216.7 193 1547 6.1
yolov9-e-qat FP16 INT8 1 532 57.8 1779.5 396 396 4.1
8 532 57.8 13431.8 62 493 17.8

levipereira avatar Mar 15 '24 21:03 levipereira

Added:

Two repositories to test YOLOv9 QAT Models

levipereira avatar Mar 18 '24 01:03 levipereira

Thanks for sharing, it would be better if there is export onnx for independent deployment, not just triton

ou525 avatar Mar 18 '24 02:03 ou525

@levipereira It would be interesting to see how the performance on triton compares with Yolov7-QAT , since the paper does not talk about it and neither does #143 .

trivedisarthak avatar Mar 19 '24 20:03 trivedisarthak

@levipereira Thank you for your contribution. I need to ask a question, Do I have to train model in order to get a quantized model?

demuxin avatar Mar 25 '24 07:03 demuxin

@demuxin Yes.

levipereira avatar Mar 26 '24 14:03 levipereira

@trivedisarthak check OP

levipereira avatar Mar 26 '24 14:03 levipereira

The Original Implementation in #327

levipereira avatar Apr 06 '24 14:04 levipereira