PaddleDetection icon indicating copy to clipboard operation
PaddleDetection copied to clipboard

多路视频流处理,性能降低

Open time-heart opened this issue 8 months ago • 59 comments

问题确认 Search before asking

  • [x] 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

单路 5
	read()
		帧获取总耗时203
	模型处理耗时248 消息间隔总耗时468.40125700000056(总数2500,最大值0.264324,最小值0.115836),平均值0.187
两路 4
	帧获取总耗时203 模型处理耗时406 消息间隔总耗时620(总数2500,最大值0.527268,最小值0.106337),平均值0.248
	帧获取总耗时214 模型处理耗时435 消息间隔总耗时673(总数2500,最大值0.8767,最小值0.113935),平均值0.269
三路 3
	帧获取总耗时287 模型处理耗时624 消息间隔总耗时913(总数2500,最大值0.685565,最小值0.143708),平均值0.365 
	帧获取总耗时293 模型处理耗时587 消息间隔总耗时896(总数2500,最大值0.713711,最小值0.129972),平均值0.358
	帧获取总耗时238 模型处理耗时726 消息间隔总耗时995(总数2500,最大值1.004392,最小值0.114476),平均值0.398 为什么随着路数的增加,模型处理越来越慢,cpu8核,单卡gpu

time-heart avatar Apr 25 '25 03:04 time-heart

请问可以给出最小复现例子吗?具体的实现方式也可能会有影响。例如,一块GPU上如果只使用一个CUDA stream的话,所有的操作都会是串行执行的,即使用多个线程同时进行推理。

Bobholamovic avatar Apr 25 '25 03:04 Bobholamovic

请问可以给出最小复现例子吗?具体的实现方式也可能会有影响。例如,一块GPU上如果只使用一个CUDA stream的话,所有的操作都会是串行执行的,即使用多个线程同时进行推理。

python pipeline/pipeline_c2.py --config pipeline/config/infer_cfg_2.yml --device=gpu --do_break_in_counting --region_type=custom --illegal_parking_time=1 --video_file=resource/video-10min/r5.mp4 --run_mode trt_int8 --trt_calib_mode True python pipeline/pipeline_c2.py --config pipeline/config/infer_cfg_2.yml --device=gpu --do_break_in_counting --region_type=custom --illegal_parking_time=1 --video_file=resource/video-10min/r4.mp4 --run_mode trt_int8 --trt_calib_mode True 之前使用多线程耗时更高,现在改用两个进程耗时有明显降低,但是处理路数多了,模型处理耗时就更高了

time-heart avatar Apr 25 '25 03:04 time-heart

多线程的情况,需要为每个predictor设置单独的cuda stream,才能实现加速;多进程的情况,“改用两个进程耗时有明显降低,但是处理路数多了,模型处理耗时就更高了”,请问具体是什么耗时降低了,什么耗时升高了哦?另外,从例子中来看,是分别启动了两个Python解释器,同时执行一个脚本,来实现多进程的吗?

Bobholamovic avatar Apr 25 '25 06:04 Bobholamovic

多线程的情况,需要为每个predictor设置单独的cuda stream,才能实现加速;多进程的情况,“改用两个进程耗时有明显降低,但是处理路数多了,模型处理耗时就更高了”,请问具体是什么耗时降低了,什么耗时升高了哦?另外,从例子中来看,是分别启动了两个Python解释器,同时执行一个脚本,来实现多进程的吗?

是的,同时执行一个脚本,为什么我只处理一路视频,这个python进程cpu占用率会达到200%甚至300%,但是我的脚本里没有启动其他线程只有一个主线程

Image

time-heart avatar Apr 25 '25 06:04 time-heart

一些底层的图像处理库(例如OpenCV)以及推理库(例如Paddle)可能会使用多线程来加速,这样可以充分利用多核CPU的能力~如果希望禁用多线程的话,可以对底层库进行相应的设置,不过,这可能会导致推理速度下降

Bobholamovic avatar Apr 25 '25 06:04 Bobholamovic

一些底层的图像处理库(例如OpenCV)以及推理库(例如Paddle)可能会使用多线程来加速,这样可以充分利用多核CPU的能力~如果希望禁用多线程的话,可以对底层库进行相应的设置,不过,这可能会导致推理速度下降

Image处理一路视频它包含了17个线程,如果这样的话路数越多,整个跟踪检测的耗时就会增大,有什么解决办法吗

time-heart avatar Apr 25 '25 07:04 time-heart

一些底层的图像处理库(例如OpenCV)以及推理库(例如Paddle)可能会使用多线程来加速,这样可以充分利用多核CPU的能力~如果希望禁用多线程的话,可以对底层库进行相应的设置,不过,这可能会导致推理速度下降

Image处理一路视频它包含了17个线程,如果这样的话路数越多,整个跟踪检测的耗时就会增大,有什么解决办法吗

我本机编译的paddle-gpu 环境是jetpack6.2全套的,本来是想用ppyole的超轻量级的模型来跑,但是ppyoloe_plus_crn_t_auxhead_relu_320_300e_coco.yml这个模型使用trt跑不了说内存不够,所以改用的ppyoloe_plus_crn_s_80e_coco

time-heart avatar Apr 25 '25 07:04 time-heart

我不确定具体是哪个库导致的,如果是使用trt推理的话,我建议考虑OpenCV的设置,可以参考: https://github.com/opencv/opencv/issues/15277

Bobholamovic avatar Apr 25 '25 09:04 Bobholamovic

我不确定具体是哪个库导致的,如果是使用trt推理的话,我建议考虑OpenCV的设置,可以参考: opencv/opencv#15277

请问为什么我这个环境跑不了超轻量级的模型呢 cuda 12.6 cudnn 9.3.0.75-1 TensorRT 10.3.0.30-1 报错内存不够,但是ppyoloe_plus_crn_s_80e_coco这个模型就可以

time-heart avatar Apr 25 '25 09:04 time-heart

具体是报什么错呢?另外观察到显存占用怎么样?

Bobholamovic avatar Apr 25 '25 10:04 Bobholamovic

具体是报什么错呢?另外观察到显存占用怎么样? 还没开始处理就报错了,创建预测器阶段报错,提示的是需要分配13个g的内存,但是不足,我不明白为什么要先分配这么大的内存,之前在windows 3060跑都可以,换到jetson设备上就用不了,只能用ppyoloe_plus_crn_s_80e_coco这个模型来替代,但是处理速度每张图片需要150ms左右,包括监测+跟踪

time-heart avatar Apr 25 '25 12:04 time-heart

这可能和转换tensorrt模型时设置的参数有关,例如动态形状配置可能影响模型占用的内存大小

Bobholamovic avatar Apr 25 '25 14:04 Bobholamovic

同样的模型在windows上3060加载就没问题,Jetson上paddle跑也是可以的

---原始邮件--- 发件人: "Lin @.> 发送时间: 2025年4月25日(周五) 晚上10:15 收件人: @.>; 抄送: @.@.>; 主题: Re: [PaddlePaddle/PaddleDetection] 多路视频流处理,性能降低 (Issue #9361)

Bobholamovic left a comment (PaddlePaddle/PaddleDetection#9361)

这可能和转换tensorrt模型时设置的参数有关,例如动态形状配置可能影响模型占用的内存大小

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

time-heart avatar Apr 26 '25 00:04 time-heart

  1. Windows机器的资源很可能比jetson充足;
  2. trt做了更多优化,通常需要占用比paddle更多的内存资源。

综上来看,还是比较可能是我说的原因。建议关注trt的优化参数。

Bobholamovic avatar Apr 27 '25 03:04 Bobholamovic

  1. Windows机器的资源很可能比jetson充足;
  2. trt做了更多优化,通常需要占用比paddle更多的内存资源。

综上来看,还是比较可能是我说的原因。建议关注trt的优化参数。

建议关注trt的优化参数,什么意思,我还用了一个ppyoloe_plus_crn_t_auxhead_320_300e_coco这个模型相比与ppyoloe_plus_crn_s_80e_coco这个模型在int8下检测耗时没有什么太大的区别

time-heart avatar Apr 27 '25 04:04 time-heart

  1. Windows机器的资源很可能比jetson充足;
  2. trt做了更多优化,通常需要占用比paddle更多的内存资源。

综上来看,还是比较可能是我说的原因。建议关注trt的优化参数。

按理来说ppyoloe_plus_crn_t_auxhead_320_300e_coco这个超轻量级的模型应该会比ppyoloe_plus_crn_s_80e_coco这个模型跑的更快,但是7路视频同时处理下来还更慢了,运行命令如下python pipeline/pipeline_c2.py --config pipeline/config/infer_cfg_2.yml --device=gpu --do_break_in_counting --region_type=custom --illegal_parking_time=1 --video_file=resource/video-10min/r10.mp4 --run_mode trt_int8 --trt_calib_mode True ,并没有生成校准文件

time-heart avatar Apr 27 '25 07:04 time-heart

  1. Windows机器的资源很可能比jetson充足;
  2. trt做了更多优化,通常需要占用比paddle更多的内存资源。

综上来看,还是比较可能是我说的原因。建议关注trt的优化参数。

建议关注trt的优化参数,什么意思,我还用了一个ppyoloe_plus_crn_t_auxhead_320_300e_coco这个模型相比与ppyoloe_plus_crn_s_80e_coco这个模型在int8下检测耗时没有什么太大的区别

请问trt_min_shape、trt_opt_shape、trt_max_shape这几个参数是如何设置的呀?

Bobholamovic avatar Apr 27 '25 11:04 Bobholamovic

  1. Windows机器的资源很可能比jetson充足;
  2. trt做了更多优化,通常需要占用比paddle更多的内存资源。

综上来看,还是比较可能是我说的原因。建议关注trt的优化参数。

建议关注trt的优化参数,什么意思,我还用了一个ppyoloe_plus_crn_t_auxhead_320_300e_coco这个模型相比与ppyoloe_plus_crn_s_80e_coco这个模型在int8下检测耗时没有什么太大的区别

请问trt_min_shape、trt_opt_shape、trt_max_shape这几个参数是如何设置的呀?

1,640,1280

time-heart avatar Apr 27 '25 12:04 time-heart

3个都设置的是这个值吗?另外,每个模型都是这样设置的嘛?

Bobholamovic avatar Apr 27 '25 15:04 Bobholamovic

3个都设置的是这个值吗?另外,每个模型都是这样设置的嘛?

if run_mode in precision_map.keys():
    config.enable_tensorrt_engine(
        workspace_size=1 << 25,
        max_batch_size=batch_size,
        min_subgraph_size=min_subgraph_size,
        precision_mode=precision_map[run_mode],
        use_static=True,
        use_calib_mode=trt_calib_mode)
    config.set_optim_cache_dir("./tensorrt_cache") 
    if use_dynamic_shape:
        min_input_shape = {
            'image': [batch_size, 3, trt_min_shape, trt_min_shape]
        }
        max_input_shape = {
            'image': [batch_size, 3, trt_max_shape, trt_max_shape]
        }
        opt_input_shape = {
            'image': [batch_size, 3, trt_opt_shape, trt_opt_shape]
        }
        config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape,
                                          opt_input_shape)
        print('trt set dynamic shape done!') 您说的写参数是模型的配置文件里的吗

time-heart avatar Apr 28 '25 01:04 time-heart

3个都设置的是这个值吗?另外,每个模型都是这样设置的嘛?

if run_mode in precision_map.keys():
    config.enable_tensorrt_engine(
        workspace_size=1 << 25,
        max_batch_size=batch_size,
        min_subgraph_size=min_subgraph_size,
        precision_mode=precision_map[run_mode],
        use_static=True,
        use_calib_mode=trt_calib_mode)
    config.set_optim_cache_dir("./tensorrt_cache") 
    if use_dynamic_shape:
        min_input_shape = {
            'image': [batch_size, 3, trt_min_shape, trt_min_shape]
        }
        max_input_shape = {
            'image': [batch_size, 3, trt_max_shape, trt_max_shape]
        }
        opt_input_shape = {
            'image': [batch_size, 3, trt_opt_shape, trt_opt_shape]
        }
        config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape,
                                          opt_input_shape)
        print('trt set dynamic shape done!') 您说的参数是模型的配置文件里的吗

mode: paddle draw_threshold: 0.5 metric: COCO use_dynamic_shape: false arch: YOLO min_subgraph_size: 3 Preprocess:

  • interp: 2 keep_ratio: false target_size:

    • 640
    • 640 type: Resize
  • mean:

    • 0.0
    • 0.0
    • 0.0 norm_type: none std:
    • 1.0
    • 1.0
    • 1.0 type: NormalizeImage
  • type: Permute这是s的模型的配置文件,mode: paddle draw_threshold: 0.5 metric: COCO use_dynamic_shape: false arch: PPYOLOE min_subgraph_size: 3 Preprocess:

  • interp: 2 keep_ratio: false target_size:

    • 320
    • 320 type: Resize
  • mean:

    • 0.0
    • 0.0
    • 0.0 norm_type: none std:
    • 1.0
    • 1.0
    • 1.0 type: NormalizeImage
  • type: Permute这是t的模型的配置文件,超轻量级的还不如small的,理论上不应该啊,我怀疑根本就没用上trt加速,但是加载预测器的时候确实加载成功了

  • precision_map = {
      'trt_int8': Config.Precision.Int8,
      'trt_fp32': Config.Precision.Float32,
      'trt_fp16': Config.Precision.Half
    

    } if run_mode in precision_map.keys(): config.enable_tensorrt_engine( workspace_size=1 << 25, max_batch_size=batch_size, min_subgraph_size=min_subgraph_size, precision_mode=precision_map[run_mode], use_static=True, use_calib_mode=trt_calib_mode) config.set_optim_cache_dir("./tensorrt_cache") if use_dynamic_shape: min_input_shape = { 'image': [batch_size, 3, trt_min_shape, trt_min_shape] } max_input_shape = { 'image': [batch_size, 3, trt_max_shape, trt_max_shape] } opt_input_shape = { 'image': [batch_size, 3, trt_opt_shape, trt_opt_shape] } config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape, opt_input_shape) print('trt set dynamic shape done!')

    config.disable_glog_info() config.enable_memory_optim() config.switch_use_feed_fetch_ops(False) predictor = create_predictor(config) print(f"{run_mode}加速成功")

time-heart avatar Apr 28 '25 01:04 time-heart

3个都设置的是这个值吗?另外,每个模型都是这样设置的嘛?

都是使用的默认的, trt_min_shape=1, trt_max_shape=1280, trt_opt_shape=640,

time-heart avatar Apr 28 '25 01:04 time-heart

看起来确实比较奇怪~请问方便分别提供一下轻量级和超轻量级模型的执行日志吗?

Bobholamovic avatar Apr 28 '25 02:04 Bobholamovic

看起来确实比较奇怪~请问方便分别提供一下轻量级和超轻量级模型的执行日志吗?

config.disable_glog_info() 预测时的日志吗

time-heart avatar Apr 28 '25 02:04 time-heart

看起来确实比较奇怪~请问方便分别提供一下轻量级和超轻量级模型的执行日志吗?

--- Running PIR pass [add_shadow_output_after_dead_parameter_pass] I0428 02:49:58.460706 57741 print_statistics.cc:50] --- detected [8] subgraphs! --- Running PIR pass [delete_quant_dequant_linear_op_pass] --- Running PIR pass [delete_weight_dequant_linear_op_pass] --- Running PIR pass [map_op_to_another_pass] --- Running PIR pass [identity_op_clean_pass] --- Running PIR pass [silu_fuse_pass] --- Running PIR pass [conv2d_bn_fuse_pass] I0428 02:49:58.474509 57741 print_statistics.cc:50] --- detected [56] subgraphs! --- Running PIR pass [conv2d_add_act_fuse_pass] --- Running PIR pass [conv2d_add_fuse_pass] I0428 02:49:58.521495 57741 print_statistics.cc:50] --- detected [83] subgraphs! --- Running PIR pass [embedding_eltwise_layernorm_fuse_pass] --- Running PIR pass [fused_rotary_position_embedding_pass] --- Running PIR pass [multihead_matmul_fuse_pass] --- Running PIR pass [matmul_add_act_fuse_pass] --- Running PIR pass [fc_elementwise_layernorm_fuse_pass] --- Running PIR pass [add_norm_fuse_pass] --- Running PIR pass [group_norm_silu_fuse_pass] --- Running PIR pass [matmul_scale_fuse_pass] --- Running PIR pass [matmul_transpose_fuse_pass] --- Running PIR pass [transpose_flatten_concat_fuse_pass] --- Running PIR pass [remove_redundant_transpose_pass] --- Running PIR pass [horizontal_fuse_pass] --- Running PIR pass [transfer_layout_pass] --- Running PIR pass [common_subexpression_elimination_pass] I0428 02:49:58.563827 57741 print_statistics.cc:50] --- detected [185] subgraphs! --- Running PIR pass [params_sync_among_devices_pass] I0428 02:49:58.624049 57741 print_statistics.cc:50] --- detected [345] subgraphs! --- Running PIR pass [constant_folding_pass] I0428 02:49:58.627278 57741 pir_interpreter.cc:1568] New Executor is Running ... W0428 02:49:58.627472 57741 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.7, Driver API Version: 12.6, Runtime API Version: 12.6 W0428 02:49:58.629119 57741 gpu_resources.cc:164] device: 0, cuDNN Version: 9.3. I0428 02:49:58.631258 57741 pir_interpreter.cc:1592] pir interpreter is running by multi-thread mode ... I0428 02:49:59.968214 57741 print_statistics.cc:44] --- detected [526, 1142] subgraphs! --- Running PIR pass [dead_code_elimination_pass] I0428 02:49:59.972285 57741 print_statistics.cc:50] --- detected [677] subgraphs! --- Running PIR pass [replace_fetch_with_shadow_output_pass] I0428 02:49:59.973526 57741 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [remove_shadow_feed_pass] I0428 02:49:59.992839 57741 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [inplace_pass] I0428 02:50:00.029798 57741 print_statistics.cc:50] --- detected [43] subgraphs! I0428 02:50:00.030298 57741 analysis_predictor.cc:1207] ======= pir optimization completed ======= trt_int8加速成功 [2025-04-28 02:50:00,191] [ INFO] pipeline_c2.py:704 - 0号线程,当前已处理帧0 I0428 02:50:00.274374 57741 pir_interpreter.cc:1589] pir interpreter is running by trace mode ... 这是s模型的

--- Running PIR pass [add_shadow_output_after_dead_parameter_pass] I0428 02:51:02.915006 60421 print_statistics.cc:50] --- detected [8] subgraphs! --- Running PIR pass [delete_quant_dequant_linear_op_pass] --- Running PIR pass [delete_weight_dequant_linear_op_pass] --- Running PIR pass [map_op_to_another_pass] --- Running PIR pass [identity_op_clean_pass] --- Running PIR pass [silu_fuse_pass] --- Running PIR pass [conv2d_bn_fuse_pass] I0428 02:51:02.927734 60421 print_statistics.cc:50] --- detected [50] subgraphs! --- Running PIR pass [conv2d_add_act_fuse_pass] I0428 02:51:02.959388 60421 print_statistics.cc:50] --- detected [61] subgraphs! --- Running PIR pass [conv2d_add_fuse_pass] I0428 02:51:02.969374 60421 print_statistics.cc:50] --- detected [16] subgraphs! --- Running PIR pass [embedding_eltwise_layernorm_fuse_pass] --- Running PIR pass [fused_rotary_position_embedding_pass] --- Running PIR pass [multihead_matmul_fuse_pass] --- Running PIR pass [matmul_add_act_fuse_pass] --- Running PIR pass [fc_elementwise_layernorm_fuse_pass] --- Running PIR pass [add_norm_fuse_pass] --- Running PIR pass [group_norm_silu_fuse_pass] --- Running PIR pass [matmul_scale_fuse_pass] --- Running PIR pass [matmul_transpose_fuse_pass] --- Running PIR pass [transpose_flatten_concat_fuse_pass] --- Running PIR pass [remove_redundant_transpose_pass] --- Running PIR pass [horizontal_fuse_pass] --- Running PIR pass [transfer_layout_pass] --- Running PIR pass [common_subexpression_elimination_pass] I0428 02:51:03.007035 60421 print_statistics.cc:50] --- detected [168] subgraphs! --- Running PIR pass [params_sync_among_devices_pass] I0428 02:51:03.038362 60421 print_statistics.cc:50] --- detected [315] subgraphs! --- Running PIR pass [constant_folding_pass] I0428 02:51:03.041201 60421 pir_interpreter.cc:1568] New Executor is Running ... W0428 02:51:03.041368 60421 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.7, Driver API Version: 12.6, Runtime API Version: 12.6 W0428 02:51:03.042938 60421 gpu_resources.cc:164] device: 0, cuDNN Version: 9.3. I0428 02:51:03.044843 60421 pir_interpreter.cc:1592] pir interpreter is running by multi-thread mode ... I0428 02:51:04.277839 60421 print_statistics.cc:44] --- detected [481, 999] subgraphs! --- Running PIR pass [dead_code_elimination_pass] I0428 02:51:04.282008 60421 print_statistics.cc:50] --- detected [611] subgraphs! --- Running PIR pass [replace_fetch_with_shadow_output_pass] I0428 02:51:04.283336 60421 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [remove_shadow_feed_pass] I0428 02:51:04.302112 60421 print_statistics.cc:50] --- detected [2] subgraphs! --- Running PIR pass [inplace_pass] I0428 02:51:04.334331 60421 print_statistics.cc:50] --- detected [44] subgraphs! I0428 02:51:04.334729 60421 analysis_predictor.cc:1207] ======= pir optimization completed ======= trt_int8加速成功 [2025-04-28 02:51:04,492] [ INFO] pipeline_c2.py:704 - 0号线程,当前已处理帧0 I0428 02:51:04.547617 60421 pir_interpreter.cc:1589] pir interpreter is running by trace mode ... 这是t的

time-heart avatar Apr 28 '25 02:04 time-heart

看起来确实比较奇怪~请问方便分别提供一下轻量级和超轻量级模型的执行日志吗?

分配的线程太少会不会影响模型处理速度,现在是默认初始化分配的是200M

time-heart avatar Apr 28 '25 02:04 time-heart

看起来确实比较奇怪~请问方便分别提供一下轻量级和超轻量级模型的执行日志吗?

trt=True 就是这个参数导出模型时带上这个参数,导出后的模型预测时会比不带那个参数导出的还要慢,我不使用这个参数导出的模型预测耗时35ms左右,使用了那个参数耗时60ms左右

time-heart avatar Apr 28 '25 03:04 time-heart

看起来好像没有真正启用trt……请问有修改代码不?

Bobholamovic avatar Apr 28 '25 07:04 Bobholamovic

看起来好像没有真正启用trt……请问有修改代码不?

导出模型的代码还是pipeline中的代码

time-heart avatar Apr 28 '25 07:04 time-heart

看起来好像没有真正启用trt……请问有修改代码不?

模型导出的代码没有动

time-heart avatar Apr 28 '25 08:04 time-heart