FastDeploy [Backend] cuda normalize and permute, cuda concat, optimized ppcls, ppdet & ppseg

PR types(PR类型)

Backend

Describe

CUDA kernel for Color convert + Normalize + Permute
Concat support GPU
Applied this optimization to PPClas, PPDet and PPSeg
Tested on PPLCNetV2, PPYOLOE and UNet.

Nov 09 '22 10:11 wang-xinyu

PPClas End2end test on T4, TRT8.4 Latency in ms.

Version	Model	FP32	FP16	INT8
0.3.0	PP-LCNetv2	2.23	1.87	1.78
This	PP-LCNetv2	1.54	1.03	0.89

PPClas End2end improved ~30%.

Nov 10 '22 08:11 wang-xinyu

PPDet End2end test on T4, TRT8.4 Latency in ms.

.	Model	FP32
Before	PPYOLOE	35.05
After	PPYOLOE	32.97

Inference is 32.6ms, so Preprocessing is 2.45ms -> 0.37ms, preprocessing is 6.6x faster.

Nov 10 '22 08:11 wang-xinyu

PPSeg UNet PPInfer GPU backend, T4.

Latency in ms.

Input size	Preprocess latency, before	after
2048x1024	33.921	1.405
320x160	0.228	0.194

Preprocessing is 1.18x ~ 24.14x faster, depends on input size.

Nov 10 '22 10:11 wang-xinyu

另外预处理设置的gpu_id，如果与runtime的gpu_id不符，还需处理ORT、Paddle、和TensorRT在接收到输入后的情况

@jiangjiajun @heliqi 我们是在三个backend里面加device id的判断和cudaMemcpyPeerAsync()吗？还是device id不一致的时候Assert？如果不给用户提示就直接copy到不同的device，可能会影响性能

Nov 14 '22 02:11 wang-xinyu

另外预处理设置的gpu_id，如果与runtime的gpu_id不符，还需处理ORT、Paddle、和TensorRT在接收到输入后的情况

@jiangjiajun @heliqi 我们是在三个backend里面加device id的判断和cudaMemcpyPeerAsync()吗？还是device id不一致的时候Assert？如果不给用户提示就直接copy到不同的device，可能会影响性能

我建议先Assert掉，至少给个警告。

正常使用Model这种简单接口时不推荐用户把预处理和runtime设置在不同的卡上
需要在Model(比如yolov5)里处理下: 如果用户设置了RuntimeOption的device_id却没有设置预处理，这个时候预处理默认与runtime保持一致。有些用户只知道runtime的设置，不知道预处理也要设置，所以不特殊设置预处理就默认与runtime保持一致
如果用户同时设置了预处理和runtime，有Assert和Warning两个方案 a. 需要用户自己把FDTensor拷贝到同一张卡 (这个PR可以先assert，不同backend的处理不同，可以下个PR支持) b. 如果我们backend支持的话，就给个警告提示

Nov 14 '22 02:11 heliqi

需要在Model(比如yolov5)里处理下: 如果用户设置了RuntimeOption的device_id却没有设置预处理，这个时候预处理默认与runtime保持一致。有些用户只知道runtime的设置，不知道预处理也要设置，所以不特殊设置预处理就默认与runtime保持一致

@heliqi 这个逻辑不好实现呢，因为PaddleClasPreprocessor::UseGpu(int gpu_id=0)函数来设置device id的，Preprocessor默认是不用GPU的。只能通过传入的gpu id来设置。或者不给这个gpu id设置默认值，让用户显式指定。

Nov 14 '22 02:11 wang-xinyu

需要在Model(比如yolov5)里处理下: 如果用户设置了RuntimeOption的device_id却没有设置预处理，这个时候预处理默认与runtime保持一致。有些用户只知道runtime的设置，不知道预处理也要设置，所以不特殊设置预处理就默认与runtime保持一致

@heliqi 这个逻辑不好实现呢，因为PaddleClasPreprocessor::UseGpu(int gpu_id=0)函数来设置device id的，Preprocessor默认是不用GPU的。只能通过传入的gpu id来设置。或者不给这个gpu id设置默认值，让用户显式指定。

这个逻辑不要写在预处理里，预处理只管设置和处理，相关逻辑在我们Model那一层判断。比如不要写在PaddleClasPreprocessor中，而是写在PaddleClasModel的构造函数。

点错了，不小心关了..

Nov 14 '22 03:11 heliqi

FastDeploy FastDeploy copied to clipboard

[Backend] cuda normalize and permute, cuda concat, optimized ppcls, ppdet & ppseg

PR types(PR类型)

Describe

FastDeploy
FastDeploy copied to clipboard