FastDeploy icon indicating copy to clipboard operation
FastDeploy copied to clipboard

[Backend] cuda normalize and permute, cuda concat, optimized ppcls, ppdet & ppseg

Open wang-xinyu opened this issue 1 year ago • 3 comments

PR types(PR类型)

Backend

Describe

  • CUDA kernel for Color convert + Normalize + Permute
  • Concat support GPU
  • Applied this optimization to PPClas, PPDet and PPSeg
  • Tested on PPLCNetV2, PPYOLOE and UNet.

wang-xinyu avatar Nov 09 '22 10:11 wang-xinyu

PPClas End2end test on T4, TRT8.4 Latency in ms.

Version Model FP32 FP16 INT8
0.3.0 PP-LCNetv2 2.23 1.87 1.78
This PP-LCNetv2 1.54 1.03 0.89

PPClas End2end improved ~30%.

wang-xinyu avatar Nov 10 '22 08:11 wang-xinyu

PPDet End2end test on T4, TRT8.4 Latency in ms.

. Model FP32
Before PPYOLOE 35.05
After PPYOLOE 32.97

Inference is 32.6ms, so Preprocessing is 2.45ms -> 0.37ms, preprocessing is 6.6x faster.

wang-xinyu avatar Nov 10 '22 08:11 wang-xinyu

PPSeg UNet PPInfer GPU backend, T4.

Latency in ms.

Input size Preprocess latency, before after
2048x1024 33.921 1.405
320x160 0.228 0.194

Preprocessing is 1.18x ~ 24.14x faster, depends on input size.

wang-xinyu avatar Nov 10 '22 10:11 wang-xinyu