FastDeploy
FastDeploy copied to clipboard
[Backend] cuda normalize and permute, cuda concat, optimized ppcls, ppdet & ppseg
PR types(PR类型)
Backend
Describe
- CUDA kernel for Color convert + Normalize + Permute
- Concat support GPU
- Applied this optimization to PPClas, PPDet and PPSeg
- Tested on PPLCNetV2, PPYOLOE and UNet.
PPClas End2end test on T4, TRT8.4 Latency in ms.
Version | Model | FP32 | FP16 | INT8 |
---|---|---|---|---|
0.3.0 | PP-LCNetv2 | 2.23 | 1.87 | 1.78 |
This | PP-LCNetv2 | 1.54 | 1.03 | 0.89 |
PPClas End2end improved ~30%.
PPDet End2end test on T4, TRT8.4 Latency in ms.
. | Model | FP32 |
---|---|---|
Before | PPYOLOE | 35.05 |
After | PPYOLOE | 32.97 |
Inference is 32.6ms, so Preprocessing is 2.45ms -> 0.37ms, preprocessing is 6.6x faster.
PPSeg UNet PPInfer GPU backend, T4.
Latency in ms.
Input size | Preprocess latency, before | after |
---|---|---|
2048x1024 | 33.921 | 1.405 |
320x160 | 0.228 | 0.194 |
Preprocessing is 1.18x ~ 24.14x faster, depends on input size.