PaddleDetection infer.py推理单张图片，用GPU反而比不用更耗时？

问题确认 Search before asking

[X] 我已经查询历史issue，没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

No response

Bug描述 Describe the Bug

我在一台v100的机器上，用PaddleDetection/deploy/python/infer.py来推理, 发现用--device GPU和不用，前者的速度居然比后者慢。

测试过程：

不用GPU：

python3 /app/PaddleDetection/deploy/python/infer.py --model_dir=/opt/ml/model/layout --image_file=test_data//mask_0e765753e645a104c0bbea1f4e739317.jpeg

输出：

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
-----------  Running Arguments -----------
action_file: None
batch_size: 1
camera_id: -1
collect_trt_shape_info: False
combine_method: nms
cpu_threads: 1
device: cpu
enable_mkldnn: False
enable_mkldnn_bfloat16: False
image_dir: None
image_file: test_data//mask_0e765753e645a104c0bbea1f4e739317.jpeg
match_metric: ios
match_threshold: 0.6
model_dir: /opt/ml/model/layout
output_dir: output
overlap_ratio: [0.25, 0.25]
random_pad: False
reid_batch_size: 50
reid_model_dir: None
run_benchmark: False
run_mode: paddle
save_images: True
save_mot_txt_per_img: False
save_mot_txts: False
save_results: False
scaled: False
slice_infer: False
slice_size: [640, 640]
threshold: 0.5
tracker_config: None
trt_calib_mode: False
trt_max_shape: 1280
trt_min_shape: 1
trt_opt_shape: 640
tuned_trt_shape_file: shape_range_info.pbtxt
use_coco_category: False
use_dark: True
use_fd_format: False
use_gpu: False
video_file: None
window_size: 50
------------------------------------------
-----------  Model Configuration -----------
Model Arch: GFL
Transform Order: 
--transform op: Resize
--transform op: NormalizeImage
--transform op: Permute
--transform op: PadStride
--------------------------------------------
loaded detector cost 0.5786027908325195s
class_id:3, confidence:0.6529, left_top:[31.77,336.25],right_bottom:[732.56,1074.07]
class_id:4, confidence:0.6105, left_top:[520.25,751.66],right_bottom:[748.16,896.60]
save result to: output/mask_0e765753e645a104c0bbea1f4e739317.jpeg
Test iter 0
predict  cost 0.35836076736450195s
------------------ Inference Time Info ----------------------
total_time(ms): 335.2, img_num: 1
average latency time(ms): 335.20, QPS: 2.983294
preprocess_time(ms): 59.50, inference_time(ms): 275.70, postprocess_time(ms): 0.00

用GPU

python3 /app/PaddleDetection/deploy/python/infer.py --model_dir=/opt/ml/model/layout --image_file=test_data//mask_0e765753e645a104c0bbea1f4e739317.jpeg --device=GPU

输出

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
-----------  Running Arguments -----------
action_file: None
batch_size: 1
camera_id: -1
collect_trt_shape_info: False
combine_method: nms
cpu_threads: 1
device: GPU
enable_mkldnn: False
enable_mkldnn_bfloat16: False
image_dir: None
image_file: test_data//mask_0e765753e645a104c0bbea1f4e739317.jpeg
match_metric: ios
match_threshold: 0.6
model_dir: /opt/ml/model/layout
output_dir: output
overlap_ratio: [0.25, 0.25]
random_pad: False
reid_batch_size: 50
reid_model_dir: None
run_benchmark: False
run_mode: paddle
save_images: True
save_mot_txt_per_img: False
save_mot_txts: False
save_results: False
scaled: False
slice_infer: False
slice_size: [640, 640]
threshold: 0.5
tracker_config: None
trt_calib_mode: False
trt_max_shape: 1280
trt_min_shape: 1
trt_opt_shape: 640
tuned_trt_shape_file: shape_range_info.pbtxt
use_coco_category: False
use_dark: True
use_fd_format: False
use_gpu: False
video_file: None
window_size: 50
------------------------------------------
-----------  Model Configuration -----------
Model Arch: GFL
Transform Order: 
--transform op: Resize
--transform op: NormalizeImage
--transform op: Permute
--transform op: PadStride
--------------------------------------------

loaded detector cost 2.731602668762207s
class_id:3, confidence:0.6529, left_top:[31.77,336.25],right_bottom:[732.56,1074.07]
class_id:4, confidence:0.6105, left_top:[520.25,751.66],right_bottom:[748.16,896.60]
save result to: output/mask_0e765753e645a104c0bbea1f4e739317.jpeg
Test iter 0
predict  cost 0.9343917369842529s
------------------ Inference Time Info ----------------------
total_time(ms): 916.1, img_num: 1
average latency time(ms): 916.10, QPS: 1.091584
preprocess_time(ms): 56.60, inference_time(ms): 859.50, postprocess_time(ms): 0.00

上面只展示了分别一次调用，但我实际每个都各测试了10次，均是非常稳定的 CPU快于GPU

我还在infer.py里对Detector加载的代码打打印了耗时，对推理代码detector.predict_image()也打印了耗时，如上两条对应日志：

loaded detector cost 2.731602668762207s
predict  cost 0.9343917369842529s

对比上面日志看，加载和推理，用GPU都比CPU慢？？这是因为单张推理，瓶颈反而在加载模型和数据到GPU上，导致反而更慢？

复现环境 Environment

os: Ubuntu 22.04 PaddleDetection: release/2.7 Paddle pythone libarary: 2.6.0

Bug描述确认 Bug description confirmation

[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

[x] 我愿意提交PR！I'd like to help by submitting a PR!

Aug 12 '24 09:08 JianyuZhan

您好，是哪个模型呢？

Aug 13 '24 02:08 cuicheng01

您好，是哪个模型呢？

您好，我用的是一个基于这个文档训练并导出的模型, 是基于 picodet_lcnet_x1_0_layout训练的模型。

Aug 13 '24 04:08 JianyuZhan

建议多循环一些次数测试下呢

Aug 13 '24 11:08 cuicheng01

The issue has no response for a long time and will be closed. You can reopen or new another issue if are still confused.

From Bot

Aug 22 '24 02:08 TingquanGao

这个至少在我的case里，是能稳定复现的。所以我现在关了GPU推理，明显快很多

Aug 22 '24 03:08 JianyuZhan

其实不建议这么推理，如果真的想快速推理，建议使用TRT之类的加速方案

Aug 27 '24 02:08 cuicheng01

The issue has no response for a long time and will be closed. You can reopen or new another issue if are still confused.

From Bot

Sep 27 '24 03:09 TingquanGao