PaddleDetection 多路视频流处理，性能降低

问题确认 Search before asking

[x] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

单路 5
	read()
		帧获取总耗时203
	模型处理耗时248 消息间隔总耗时468.40125700000056(总数2500,最大值0.264324,最小值0.115836),平均值0.187
两路 4
	帧获取总耗时203 模型处理耗时406 消息间隔总耗时620(总数2500,最大值0.527268,最小值0.106337),平均值0.248
	帧获取总耗时214 模型处理耗时435 消息间隔总耗时673(总数2500,最大值0.8767,最小值0.113935),平均值0.269
三路 3
	帧获取总耗时287 模型处理耗时624 消息间隔总耗时913(总数2500,最大值0.685565,最小值0.143708),平均值0.365 
	帧获取总耗时293 模型处理耗时587 消息间隔总耗时896(总数2500,最大值0.713711,最小值0.129972),平均值0.358
	帧获取总耗时238 模型处理耗时726 消息间隔总耗时995(总数2500,最大值1.004392,最小值0.114476),平均值0.398 为什么随着路数的增加，模型处理越来越慢，cpu8核，单卡gpu

Apr 25 '25 03:04 time-heart

请问可以给出最小复现例子吗？具体的实现方式也可能会有影响。例如，一块GPU上如果只使用一个CUDA stream的话，所有的操作都会是串行执行的，即使用多个线程同时进行推理。

Apr 25 '25 03:04 Bobholamovic

请问可以给出最小复现例子吗？具体的实现方式也可能会有影响。例如，一块GPU上如果只使用一个CUDA stream的话，所有的操作都会是串行执行的，即使用多个线程同时进行推理。

python pipeline/pipeline_c2.py --config pipeline/config/infer_cfg_2.yml --device=gpu --do_break_in_counting --region_type=custom --illegal_parking_time=1 --video_file=resource/video-10min/r5.mp4 --run_mode trt_int8 --trt_calib_mode True python pipeline/pipeline_c2.py --config pipeline/config/infer_cfg_2.yml --device=gpu --do_break_in_counting --region_type=custom --illegal_parking_time=1 --video_file=resource/video-10min/r4.mp4 --run_mode trt_int8 --trt_calib_mode True 之前使用多线程耗时更高，现在改用两个进程耗时有明显降低，但是处理路数多了，模型处理耗时就更高了

Apr 25 '25 03:04 time-heart

多线程的情况，需要为每个predictor设置单独的cuda stream，才能实现加速；多进程的情况，“改用两个进程耗时有明显降低，但是处理路数多了，模型处理耗时就更高了”，请问具体是什么耗时降低了，什么耗时升高了哦？另外，从例子中来看，是分别启动了两个Python解释器，同时执行一个脚本，来实现多进程的吗？

Apr 25 '25 06:04 Bobholamovic

多线程的情况，需要为每个predictor设置单独的cuda stream，才能实现加速；多进程的情况，“改用两个进程耗时有明显降低，但是处理路数多了，模型处理耗时就更高了”，请问具体是什么耗时降低了，什么耗时升高了哦？另外，从例子中来看，是分别启动了两个Python解释器，同时执行一个脚本，来实现多进程的吗？

是的，同时执行一个脚本，为什么我只处理一路视频，这个python进程cpu占用率会达到200%甚至300%，但是我的脚本里没有启动其他线程只有一个主线程

Apr 25 '25 06:04 time-heart

一些底层的图像处理库（例如OpenCV）以及推理库（例如Paddle）可能会使用多线程来加速，这样可以充分利用多核CPU的能力～如果希望禁用多线程的话，可以对底层库进行相应的设置，不过，这可能会导致推理速度下降

Apr 25 '25 06:04 Bobholamovic

一些底层的图像处理库（例如OpenCV）以及推理库（例如Paddle）可能会使用多线程来加速，这样可以充分利用多核CPU的能力～如果希望禁用多线程的话，可以对底层库进行相应的设置，不过，这可能会导致推理速度下降

处理一路视频它包含了17个线程，如果这样的话路数越多，整个跟踪检测的耗时就会增大，有什么解决办法吗

Apr 25 '25 07:04 time-heart

一些底层的图像处理库（例如OpenCV）以及推理库（例如Paddle）可能会使用多线程来加速，这样可以充分利用多核CPU的能力～如果希望禁用多线程的话，可以对底层库进行相应的设置，不过，这可能会导致推理速度下降

处理一路视频它包含了17个线程，如果这样的话路数越多，整个跟踪检测的耗时就会增大，有什么解决办法吗

我本机编译的paddle-gpu 环境是jetpack6.2全套的，本来是想用ppyole的超轻量级的模型来跑，但是ppyoloe_plus_crn_t_auxhead_relu_320_300e_coco.yml这个模型使用trt跑不了说内存不够，所以改用的ppyoloe_plus_crn_s_80e_coco

Apr 25 '25 07:04 time-heart

我不确定具体是哪个库导致的，如果是使用trt推理的话，我建议考虑OpenCV的设置，可以参考： https://github.com/opencv/opencv/issues/15277

Apr 25 '25 09:04 Bobholamovic

我不确定具体是哪个库导致的，如果是使用trt推理的话，我建议考虑OpenCV的设置，可以参考： opencv/opencv#15277

请问为什么我这个环境跑不了超轻量级的模型呢 cuda 12.6 cudnn 9.3.0.75-1 TensorRT 10.3.0.30-1 报错内存不够，但是ppyoloe_plus_crn_s_80e_coco这个模型就可以

Apr 25 '25 09:04 time-heart

具体是报什么错呢？另外观察到显存占用怎么样？

Apr 25 '25 10:04 Bobholamovic

具体是报什么错呢？另外观察到显存占用怎么样？还没开始处理就报错了，创建预测器阶段报错，提示的是需要分配13个g的内存，但是不足，我不明白为什么要先分配这么大的内存，之前在windows 3060跑都可以，换到jetson设备上就用不了，只能用ppyoloe_plus_crn_s_80e_coco这个模型来替代，但是处理速度每张图片需要150ms左右，包括监测+跟踪

Apr 25 '25 12:04 time-heart

这可能和转换tensorrt模型时设置的参数有关，例如动态形状配置可能影响模型占用的内存大小

Apr 25 '25 14:04 Bobholamovic

同样的模型在windows上3060加载就没问题，Jetson上paddle跑也是可以的

---原始邮件--- 发件人: "Lin @.> 发送时间: 2025年4月25日(周五) 晚上10:15 收件人: @.>; 抄送: @.@.>; 主题: Re: [PaddlePaddle/PaddleDetection] 多路视频流处理，性能降低 (Issue #9361)

Bobholamovic left a comment (PaddlePaddle/PaddleDetection#9361)

这可能和转换tensorrt模型时设置的参数有关，例如动态形状配置可能影响模型占用的内存大小

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Apr 26 '25 00:04 time-heart

Windows机器的资源很可能比jetson充足；
trt做了更多优化，通常需要占用比paddle更多的内存资源。

综上来看，还是比较可能是我说的原因。建议关注trt的优化参数。

Apr 27 '25 03:04 Bobholamovic

Windows机器的资源很可能比jetson充足；

trt做了更多优化，通常需要占用比paddle更多的内存资源。

综上来看，还是比较可能是我说的原因。建议关注trt的优化参数。

建议关注trt的优化参数,什么意思，我还用了一个ppyoloe_plus_crn_t_auxhead_320_300e_coco这个模型相比与ppyoloe_plus_crn_s_80e_coco这个模型在int8下检测耗时没有什么太大的区别

Apr 27 '25 04:04 time-heart

Windows机器的资源很可能比jetson充足；

trt做了更多优化，通常需要占用比paddle更多的内存资源。

综上来看，还是比较可能是我说的原因。建议关注trt的优化参数。

按理来说ppyoloe_plus_crn_t_auxhead_320_300e_coco这个超轻量级的模型应该会比ppyoloe_plus_crn_s_80e_coco这个模型跑的更快，但是7路视频同时处理下来还更慢了，运行命令如下python pipeline/pipeline_c2.py --config pipeline/config/infer_cfg_2.yml --device=gpu --do_break_in_counting --region_type=custom --illegal_parking_time=1 --video_file=resource/video-10min/r10.mp4 --run_mode trt_int8 --trt_calib_mode True ，并没有生成校准文件

Apr 27 '25 07:04 time-heart

Windows机器的资源很可能比jetson充足；

trt做了更多优化，通常需要占用比paddle更多的内存资源。

综上来看，还是比较可能是我说的原因。建议关注trt的优化参数。

建议关注trt的优化参数,什么意思，我还用了一个ppyoloe_plus_crn_t_auxhead_320_300e_coco这个模型相比与ppyoloe_plus_crn_s_80e_coco这个模型在int8下检测耗时没有什么太大的区别

请问trt_min_shape、trt_opt_shape、trt_max_shape这几个参数是如何设置的呀？

Apr 27 '25 11:04 Bobholamovic

Windows机器的资源很可能比jetson充足；

trt做了更多优化，通常需要占用比paddle更多的内存资源。

综上来看，还是比较可能是我说的原因。建议关注trt的优化参数。

建议关注trt的优化参数,什么意思，我还用了一个ppyoloe_plus_crn_t_auxhead_320_300e_coco这个模型相比与ppyoloe_plus_crn_s_80e_coco这个模型在int8下检测耗时没有什么太大的区别

请问trt_min_shape、trt_opt_shape、trt_max_shape这几个参数是如何设置的呀？

1，640，1280

Apr 27 '25 12:04 time-heart

3个都设置的是这个值吗？另外，每个模型都是这样设置的嘛？

Apr 27 '25 15:04 Bobholamovic

3个都设置的是这个值吗？另外，每个模型都是这样设置的嘛？

if run_mode in precision_map.keys():
    config.enable_tensorrt_engine(
        workspace_size=1 << 25,
        max_batch_size=batch_size,
        min_subgraph_size=min_subgraph_size,
        precision_mode=precision_map[run_mode],
        use_static=True,
        use_calib_mode=trt_calib_mode)
    config.set_optim_cache_dir("./tensorrt_cache") 
    if use_dynamic_shape:
        min_input_shape = {
            'image': [batch_size, 3, trt_min_shape, trt_min_shape]
        }
        max_input_shape = {
            'image': [batch_size, 3, trt_max_shape, trt_max_shape]
        }
        opt_input_shape = {
            'image': [batch_size, 3, trt_opt_shape, trt_opt_shape]
        }
        config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape,
                                          opt_input_shape)
        print('trt set dynamic shape done!') 您说的写参数是模型的配置文件里的吗

Apr 28 '25 01:04 time-heart

3个都设置的是这个值吗？另外，每个模型都是这样设置的嘛？

if run_mode in precision_map.keys():
    config.enable_tensorrt_engine(
        workspace_size=1 << 25,
        max_batch_size=batch_size,
        min_subgraph_size=min_subgraph_size,
        precision_mode=precision_map[run_mode],
        use_static=True,
        use_calib_mode=trt_calib_mode)
    config.set_optim_cache_dir("./tensorrt_cache") 
    if use_dynamic_shape:
        min_input_shape = {
            'image': [batch_size, 3, trt_min_shape, trt_min_shape]
        }
        max_input_shape = {
            'image': [batch_size, 3, trt_max_shape, trt_max_shape]
        }
        opt_input_shape = {
            'image': [batch_size, 3, trt_opt_shape, trt_opt_shape]
        }
        config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape,
                                          opt_input_shape)
        print('trt set dynamic shape done!') 您说的参数是模型的配置文件里的吗

mode: paddle draw_threshold: 0.5 metric: COCO use_dynamic_shape: false arch: YOLO min_subgraph_size: 3 Preprocess:

interp: 2 keep_ratio: false target_size:
- 640
- 640 type: Resize
mean:
- 0.0
- 0.0
- 0.0 norm_type: none std:
- 1.0
- 1.0
- 1.0 type: NormalizeImage
type: Permute这是s的模型的配置文件，mode: paddle draw_threshold: 0.5 metric: COCO use_dynamic_shape: false arch: PPYOLOE min_subgraph_size: 3 Preprocess:
interp: 2 keep_ratio: false target_size:
- 320
- 320 type: Resize
mean:
- 0.0
- 0.0
- 0.0 norm_type: none std:
- 1.0
- 1.0
- 1.0 type: NormalizeImage
type: Permute这是t的模型的配置文件，超轻量级的还不如small的，理论上不应该啊，我怀疑根本就没用上trt加速，但是加载预测器的时候确实加载成功了
```
precision_map = {
  'trt_int8': Config.Precision.Int8,
  'trt_fp32': Config.Precision.Float32,
  'trt_fp16': Config.Precision.Half
```
} if run_mode in precision_map.keys(): config.enable_tensorrt_engine( workspace_size=1 << 25, max_batch_size=batch_size, min_subgraph_size=min_subgraph_size, precision_mode=precision_map[run_mode], use_static=True, use_calib_mode=trt_calib_mode) config.set_optim_cache_dir("./tensorrt_cache") if use_dynamic_shape: min_input_shape = { 'image': [batch_size, 3, trt_min_shape, trt_min_shape] } max_input_shape = { 'image': [batch_size, 3, trt_max_shape, trt_max_shape] } opt_input_shape = { 'image': [batch_size, 3, trt_opt_shape, trt_opt_shape] } config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape, opt_input_shape) print('trt set dynamic shape done!')

config.disable_glog_info() config.enable_memory_optim() config.switch_use_feed_fetch_ops(False) predictor = create_predictor(config) print(f"{run_mode}加速成功")

Apr 28 '25 01:04 time-heart

3个都设置的是这个值吗？另外，每个模型都是这样设置的嘛？

都是使用的默认的， trt_min_shape=1, trt_max_shape=1280, trt_opt_shape=640,

Apr 28 '25 01:04 time-heart

看起来确实比较奇怪～请问方便分别提供一下轻量级和超轻量级模型的执行日志吗？

Apr 28 '25 02:04 Bobholamovic

看起来确实比较奇怪～请问方便分别提供一下轻量级和超轻量级模型的执行日志吗？

config.disable_glog_info() 预测时的日志吗

Apr 28 '25 02:04 time-heart

看起来确实比较奇怪～请问方便分别提供一下轻量级和超轻量级模型的执行日志吗？

--- Running PIR pass [add_shadow_output_after_dead_parameter_pass] I0428 02:49:58.460706 57741 print_statistics.cc:50] --- detected [8] subgraphs! --- Running PIR pass [delete_quant_dequant_linear_op_pass] --- Running PIR pass [delete_weight_dequant_linear_op_pass] --- Running PIR pass [map_op_to_another_pass] --- Running PIR pass [identity_op_clean_pass] --- Running PIR pass [silu_fuse_pass] --- Running PIR pass [conv2d_bn_fuse_pass] I0428 02:49:58.474509 57741 print_statistics.cc:50] --- detected [56] subgraphs! --- Running PIR pass [conv2d_add_act_fuse_pass] --- Running PIR pass [conv2d_add_fuse_pass] I0428 02:49:58.521495 57741 print_statistics.cc:50] --- detected [83] subgraphs! --- Running PIR pass [embedding_eltwise_layernorm_fuse_pass] --- Running PIR pass [fused_rotary_position_embedding_pass] --- Running PIR pass [multihead_matmul_fuse_pass] --- Running PIR pass [matmul_add_act_fuse_pass] --- Running PIR pass [fc_elementwise_layernorm_fuse_pass] --- Running PIR pass [add_norm_fuse_pass] --- Running PIR pass [group_norm_silu_fuse_pass] --- Running PIR pass [matmul_scale_fuse_pass] --- Running PIR pass [matmul_transpose_fuse_pass] --- Running PIR pass [transpose_flatten_concat_fuse_pass] --- Running PIR pass [remove_redundant_transpose_pass] --- Running PIR pass [horizontal_fuse_pass] --- Running PIR pass [transfer_layout_pass] --- Running PIR pass [common_subexpression_elimination_pass] I0428 02:49:58.563827 57741 print_statistics.cc:50] --- detected [185] subgraphs! --- Running PIR pass [params_sync_among_devices_pass] I0428 02:49:58.624049 57741 print_statistics.cc:50] --- detected [345] subgraphs! --- Running PIR pass [constant_folding_pass] I0428 02:49:58.627278 57741 pir_interpreter.cc:1568] New Executor is Running ... W0428 02:49:58.627472 57741 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.7, Driver API Version: 12.6, Runtime API Version: 12.6 W0428 02:49:58.629119 57741 gpu_resources.cc:164] device: 0, cuDNN Version: 9.3. I0428 02:49:58.631258 57741 pir_interpreter.cc:1592] pir interpreter is running by multi-thread mode ... I0428 02:49:59.968214 57741 print_statistics.cc:44] --- detected [526, 1142] subgraphs! --- Running PIR pass [dead_code_elimination_pass] I0428 02:49:59.972285 57741 print_statistics.cc:50] --- detected [677] subgraphs! --- Running PIR pass [replace_fetch_with_shadow_output_pass] I0428 02:49:59.973526 57741 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [remove_shadow_feed_pass] I0428 02:49:59.992839 57741 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [inplace_pass] I0428 02:50:00.029798 57741 print_statistics.cc:50] --- detected [43] subgraphs! I0428 02:50:00.030298 57741 analysis_predictor.cc:1207] ======= pir optimization completed ======= trt_int8加速成功 [2025-04-28 02:50:00,191] [ INFO] pipeline_c2.py:704 - 0号线程,当前已处理帧0 I0428 02:50:00.274374 57741 pir_interpreter.cc:1589] pir interpreter is running by trace mode ... 这是s模型的

--- Running PIR pass [add_shadow_output_after_dead_parameter_pass] I0428 02:51:02.915006 60421 print_statistics.cc:50] --- detected [8] subgraphs! --- Running PIR pass [delete_quant_dequant_linear_op_pass] --- Running PIR pass [delete_weight_dequant_linear_op_pass] --- Running PIR pass [map_op_to_another_pass] --- Running PIR pass [identity_op_clean_pass] --- Running PIR pass [silu_fuse_pass] --- Running PIR pass [conv2d_bn_fuse_pass] I0428 02:51:02.927734 60421 print_statistics.cc:50] --- detected [50] subgraphs! --- Running PIR pass [conv2d_add_act_fuse_pass] I0428 02:51:02.959388 60421 print_statistics.cc:50] --- detected [61] subgraphs! --- Running PIR pass [conv2d_add_fuse_pass] I0428 02:51:02.969374 60421 print_statistics.cc:50] --- detected [16] subgraphs! --- Running PIR pass [embedding_eltwise_layernorm_fuse_pass] --- Running PIR pass [fused_rotary_position_embedding_pass] --- Running PIR pass [multihead_matmul_fuse_pass] --- Running PIR pass [matmul_add_act_fuse_pass] --- Running PIR pass [fc_elementwise_layernorm_fuse_pass] --- Running PIR pass [add_norm_fuse_pass] --- Running PIR pass [group_norm_silu_fuse_pass] --- Running PIR pass [matmul_scale_fuse_pass] --- Running PIR pass [matmul_transpose_fuse_pass] --- Running PIR pass [transpose_flatten_concat_fuse_pass] --- Running PIR pass [remove_redundant_transpose_pass] --- Running PIR pass [horizontal_fuse_pass] --- Running PIR pass [transfer_layout_pass] --- Running PIR pass [common_subexpression_elimination_pass] I0428 02:51:03.007035 60421 print_statistics.cc:50] --- detected [168] subgraphs! --- Running PIR pass [params_sync_among_devices_pass] I0428 02:51:03.038362 60421 print_statistics.cc:50] --- detected [315] subgraphs! --- Running PIR pass [constant_folding_pass] I0428 02:51:03.041201 60421 pir_interpreter.cc:1568] New Executor is Running ... W0428 02:51:03.041368 60421 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.7, Driver API Version: 12.6, Runtime API Version: 12.6 W0428 02:51:03.042938 60421 gpu_resources.cc:164] device: 0, cuDNN Version: 9.3. I0428 02:51:03.044843 60421 pir_interpreter.cc:1592] pir interpreter is running by multi-thread mode ... I0428 02:51:04.277839 60421 print_statistics.cc:44] --- detected [481, 999] subgraphs! --- Running PIR pass [dead_code_elimination_pass] I0428 02:51:04.282008 60421 print_statistics.cc:50] --- detected [611] subgraphs! --- Running PIR pass [replace_fetch_with_shadow_output_pass] I0428 02:51:04.283336 60421 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [remove_shadow_feed_pass] I0428 02:51:04.302112 60421 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [inplace_pass] I0428 02:51:04.334331 60421 print_statistics.cc:50] --- detected [44] subgraphs! I0428 02:51:04.334729 60421 analysis_predictor.cc:1207] ======= pir optimization completed ======= trt_int8加速成功 [2025-04-28 02:51:04,492] [ INFO] pipeline_c2.py:704 - 0号线程,当前已处理帧0 I0428 02:51:04.547617 60421 pir_interpreter.cc:1589] pir interpreter is running by trace mode ... 这是t的

Apr 28 '25 02:04 time-heart

看起来确实比较奇怪～请问方便分别提供一下轻量级和超轻量级模型的执行日志吗？

分配的线程太少会不会影响模型处理速度，现在是默认初始化分配的是200M

Apr 28 '25 02:04 time-heart

看起来确实比较奇怪～请问方便分别提供一下轻量级和超轻量级模型的执行日志吗？

trt=True 就是这个参数导出模型时带上这个参数，导出后的模型预测时会比不带那个参数导出的还要慢，我不使用这个参数导出的模型预测耗时35ms左右，使用了那个参数耗时60ms左右

Apr 28 '25 03:04 time-heart

看起来好像没有真正启用trt……请问有修改代码不？

Apr 28 '25 07:04 Bobholamovic

看起来好像没有真正启用trt……请问有修改代码不？

导出模型的代码还是pipeline中的代码

Apr 28 '25 07:04 time-heart

看起来好像没有真正启用trt……请问有修改代码不？

模型导出的代码没有动

Apr 28 '25 08:04 time-heart