FastDeploy icon indicating copy to clipboard operation
FastDeploy copied to clipboard

Dynamic Batch Inference

Open UygarUsta99 opened this issue 2 years ago • 18 comments

I have exported my yolov7 model with dynamic axes as tensorrt.Engine runs fine on single batch but I want to be able to make inference with multiple images.How can I achive this ? How should I feed my images to the model.predict method ? I have tried using model.predict(np.array([im.copy(),im.copy()])) but I get errors. Any help would be much appreciated.

UygarUsta99 avatar Nov 10 '22 08:11 UygarUsta99

Hi, @UygarUsta99 We are working on batch inference this week, I think it will supported in next week. Once the feature is updated, we will comment here to let you know. :)

jiangjiajun avatar Nov 10 '22 09:11 jiangjiajun

Temporary solution:

im_list = [im, im, im]
for i in range(im_list):
    model.predict(i)

or

im_numpy = np.array([im, im])
for  i range(len(im_numpy)):
    model.predict(im_numpy[i])

heliqi avatar Nov 10 '22 09:11 heliqi

@UygarUsta99 hi,now yolov5 batch_predict is support

demo:

im_list = [im, im, im]
results = model.batch_predict(im_list)
result1 = results[0]
result2 = results[1]
result3 = results[2]

wjj19950828 avatar Nov 15 '22 09:11 wjj19950828

Thanks for the update ! Great work !

UygarUsta99 avatar Nov 15 '22 14:11 UygarUsta99

@UygarUsta99 hi,now yolov7 batch_predict is support demo:

model = fd.vision.detection.YOLOv7(model_file, runtime_option=runtime_option)
results = model.batch_predict([im1, im2])
result1 = results[0]
result2 = results[1]

wjj19950828 avatar Nov 18 '22 03:11 wjj19950828

I installed fastdeploy-gpu-python==0.7.0, and still got error "YOLOv7 object has no attribute "batch_predict""

miknyko avatar Nov 18 '22 04:11 miknyko

Yolov5 batch prediction works,yolov7 does not.I have also installed the latest pip wheel.Also when can we use yolov5 with tensorrt(including batch prediction and dynamic batch prediction) ?

UygarUsta99 avatar Nov 18 '22 08:11 UygarUsta99

@miknyko Install the Develop version

pip install fastdeploy-gpu-python==0.0.0 -f https://www.paddlepaddle.org.cn/whl/fastdeploy_nightly_build.html

wjj19950828 avatar Nov 21 '22 06:11 wjj19950828

@UygarUsta99 YOLOv5 batch predict with tensorrt

import fastdeploy as fd
import cv2

# build_option
option = fd.RuntimeOption()
option.use_gpu()
option.use_trt_backend()
option.set_trt_input_shape("images", [1, 3, 640, 640])

model = fd.vision.detection.YOLOv5(args.model, runtime_option=runtime_option)
results = model.batch_predict([im1, im2])
result1 = results[0]
result2 = results[1]

YOLOv7 is same as YOLOv5

wjj19950828 avatar Nov 21 '22 06:11 wjj19950828

@UygarUsta99 YOLOv5 batch predict with tensorrt

import fastdeploy as fd
import cv2

# build_option
option = fd.RuntimeOption()
option.use_gpu()
option.use_trt_backend()
option.set_trt_input_shape("images", [1, 3, 640, 640])

model = fd.vision.detection.YOLOv5(args.model, runtime_option=runtime_option)
results = model.batch_predict([im1, im2])
result1 = results[0]
result2 = results[1]

YOLOv7 is same as YOLOv5

您好,我试了一下YOLOV7的批量预测和单张预测,平均到每张图片上的预测时间基本一样,貌似批量预测不能提高预测速度,请问您们测试的结果中批量预测会提高多少呢

TWK2022 avatar Nov 21 '22 10:11 TWK2022

@TWK2022 完整的测试脚本以及图片麻烦发我一份吧

wjj19950828 avatar Nov 21 '22 13:11 wjj19950828

@TWK2022 完整的测试脚本以及图片麻烦发我一份吧

windows GPU_4G cuda11.6 模型:您们官方的YOLOV7.onnx 图片:您们官方放风筝的那张图片 【代码】 import cv2 import time import argparse import fastdeploy parser = argparse.ArgumentParser() parser.add_argument('--path_model', default='yolov7.onnx', type=str) parser.add_argument('--path_image', default='image/001.jpg', type=str) parser.add_argument('--device', default='gpu', type=str) parser.add_argument('--inference', default='trt', type=str) args = parser.parse_args() args.n = int(input('请输入测试的轮次:')) args.batch = int(input('请输入每次的批量:')) runtime_option = fastdeploy.RuntimeOption() if args.device in ['gpu', 'cuda']: runtime_option.use_gpu() else: runtime_option.use_cpu() if args.inference in ['trt', 'tensorrt']: runtime_option.use_trt_backend() runtime_option.set_trt_input_shape("images", [args.batch, 3, 640, 640]) else: runtime_option.use_ort_backend() print('| 使用{} | 模型加载中... |'.format(args.device)) model = fastdeploy.vision.detection.YOLOv7(args.path_model, runtime_option=runtime_option) # trt此过程较慢 print('| 模型加载完毕! |') image = cv2.imread(args.path_image) image_list = [0 for _ in range(args.batch)] for i in range(args.batch): image_list[i] = image start_time = time.time() for i in range(args.n): pred = model.batch_predict(image_list) end_time = time.time() print('| 轮次:{} | 批量{} | 平均每张耗时{:.3f} |'.format(args.n, args.batch, (end_time - start_time) / args.n / args.batch)) input('>>>按回车结束(此时可以查看GPU占用量)<<<')

【测试结果】我测试了多次,偶尔批量预测会快一点点 b7b5c1c39257a05125137363587e079

32ed36333ff76cc7b4fbebb2be4de9b

db47337f16a9ca215f40505de2f213c

77ec55a25491a250d802e6747378550

TWK2022 avatar Nov 22 '22 03:11 TWK2022

@TWK2022 我这边测试结果如下:

环境

CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz GPU:T4 CUDA:11.7 Cudnn:8.4 Python:3.7 TensorRT:8.4.3.1

结果

输入大小 Runtime(ms) Average Runtime(ms) End2End(ms) Average End2End(ms)
1x3x640x640 27.03 27.03 31.41 31.41
4x3x640x640 104.98 26.24 127.05 31.76
8x3x640x640 215.35 26.92 261.61 32.70
16x3x640x640 451.92 28.24 541.85 33.86
32x3x640x640 904.88 28.28 1086.27 33.95

可以看Average Runtime这一列,TRT后端的耗时与bs大小呈线性关系,这个与模型大小以及输入图片大小有关,这个是TRT的局限

另外测了一下小模型YOLOv5s以及MobileNetv1,结果如下,可以看到小模型多bs性能还是很有优势

YOLOv5s

输入大小 Runtime(ms) Average Runtime(ms) End2End(ms) Average End2End(ms)
1x3x320x320 2.95 2.95 4.20 4.20
4x3x320x320 6.95 1.74 12.26 3.07
8x3x320x320 13.28 1.66 24.57 3.07
16x3x320x320 26.52 1.66 50.76 3.17
32x3x320x320 52.14 1.63 122.50 3.83

MobileNetv1

输入大小 Runtime(ms) Average Runtime(ms) End2End(ms) Average End2End(ms)
1x3x224x224 0.49 0.49 1.06 1.06
4x3x224x224 0.81 0.20 2.91 0.73
8x3x224x224 1.34 0.17 5.55 0.69
16x3x224x224 2.48 0.15 10.80 0.67
32x3x224x224 5.80 0.18 25.12 0.78

所以TRT在多bs时单张图片耗时不一定一定优于单bs,但放到服务端确实可以用更少的请求去处理相同的图片

wjj19950828 avatar Nov 24 '22 06:11 wjj19950828

主要是GPU卡的问题,在测试时可以观察下GPU的利用率。 如果batch=1时,GPU利用率已经很高,这时用更大的batch,GPU因为算不过来不会带来性能提升; 如果GPU利用率比较低,改用更大batch就可以明显感受到性能提升。

heliqi avatar Nov 24 '22 07:11 heliqi

@TWK2022 我这边测试结果如下:

环境

CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz GPU:T4 CUDA:11.7 Cudnn:8.4 Python:3.7 TensorRT:8.4.3.1

结果

输入大小 Runtime(ms) Average Runtime(ms) End2End(ms) Average End2End(ms) 1x3x640x640 27.03 27.03 31.41 31.41 4x3x640x640 104.98 26.24 127.05 31.76 8x3x640x640 215.35 26.92 261.61 32.70 16x3x640x640 451.92 28.24 541.85 33.86 32x3x640x640 904.88 28.28 1086.27 33.95 可以看Average Runtime这一列,TRT后端的耗时与bs大小呈线性关系,这个与模型大小以及输入图片大小有关,这个是TRT的局限

另外测了一下小模型YOLOv5s以及MobileNetv1,结果如下,可以看到小模型多bs性能还是很有优势

YOLOv5s

输入大小 Runtime(ms) Average Runtime(ms) End2End(ms) Average End2End(ms) 1x3x320x320 2.95 2.95 4.20 4.20 4x3x320x320 6.95 1.74 12.26 3.07 8x3x320x320 13.28 1.66 24.57 3.07 16x3x320x320 26.52 1.66 50.76 3.17 32x3x320x320 52.14 1.63 122.50 3.83

MobileNetv1

输入大小 Runtime(ms) Average Runtime(ms) End2End(ms) Average End2End(ms) 1x3x224x224 0.49 0.49 1.06 1.06 4x3x224x224 0.81 0.20 2.91 0.73 8x3x224x224 1.34 0.17 5.55 0.69 16x3x224x224 2.48 0.15 10.80 0.67 32x3x224x224 5.80 0.18 25.12 0.78 所以TRT在多bs时单张图片耗时不一定一定优于单bs,但放到服务端确实可以用更少的请求去处理相同的图片

非常感谢!!

TWK2022 avatar Nov 25 '22 03:11 TWK2022

How can I set confidence threshold to yolov5 !model_ocr.BatchPredict(bp_plate,&res_ocr) ? Also batch processing performance seems fine using yolov5 but I have not tested thoroughly.

UygarUsta99 avatar Nov 30 '22 08:11 UygarUsta99

You can filter the result by yourself. If you mean how to set confidence while visualize the result, refer to it's api definition, there's a parameter score_threshold

Python

https://github.com/PaddlePaddle/FastDeploy/blob/d13a55b4e06ce72b80dec7ed2dd98b12ff2f67b7/python/fastdeploy/vision/visualize/init.py#L21-L26

C++

https://github.com/PaddlePaddle/FastDeploy/blob/d13a55b4e06ce72b80dec7ed2dd98b12ff2f67b7/fastdeploy/vision/visualize/visualize.h#L55-L58

jiangjiajun avatar Nov 30 '22 11:11 jiangjiajun

You can filter the result by yourself. If you mean how to set confidence while visualize the result, refer to it's api definition, there's a parameter score_threshold

Python

https://github.com/PaddlePaddle/FastDeploy/blob/d13a55b4e06ce72b80dec7ed2dd98b12ff2f67b7/python/fastdeploy/vision/visualize/init.py#L21-L26

C++

https://github.com/PaddlePaddle/FastDeploy/blob/d13a55b4e06ce72b80dec7ed2dd98b12ff2f67b7/fastdeploy/vision/visualize/visualize.h#L55-L58

Thanks !

UygarUsta99 avatar Nov 30 '22 13:11 UygarUsta99

hi, I want dynamic batch inference in model resnet. But resnet seems don't support this. so I try to change the code following the yolov5's example, can you give a guide to show how to change any model in dynamic batch inference?

clveryang avatar Jun 13 '23 07:06 clveryang