FastDeploy
FastDeploy copied to clipboard
Yolo cuda preprocessing util and yolov5 cuda preprocessing
PR types
Performance optimization
PR changes
Others - preprocessing
Describe
- Add a YOLO CUDA preprocessing util
- Yolov5: integrate CUDA preprocessing
- cmake changes to support CUDA source files compile
Latency includes preprocessing, inference and postprocessing, in milliseconds. Tested on P40, TensorRT8.4.
Model | Latency(CPU preprocessing) | Latency(CUDA preprocessing) | Optimization |
---|---|---|---|
yolov5s | 41 | 28 | 31.7% $\downarrow$ |
yolov5lite | 40 | 22 | 45% $\downarrow$ |
yolov6s | 25 | 11 | 56% $\downarrow$ |
yolov7 | 47 | 32 | 31.9% $\downarrow$ |
yolov7_e2e | 27 | 16 | 40.7% $\downarrow$ |
This CUDA preprocessing for YOLO is using warp affine method to do resizing, which is slightly different from cv::resize(). Hence the mAP is slightly different. Below mAP(IoU=0.50:0.95 | area=all) results were tested on coco_val_2017, 5000 images, with TensorRT model.
Model | mAP(CPU preprocessing) | mAP(CUDA preprocessing) |
---|---|---|
yolov5s | 0.372 | 0.368 |
yolov6s | 0.424 | 0.418 |
yolov7 | 0.514 | 0.498 |