Paddle Paddle Inference 将所有参数与 OP放在 GPU 上推理模型

请提出你的问题 Please ask your question

在 Paddle inference 中，如果配置 paddle.inference.Config.enable_use_gpu() 能够启动 GPU 推理，但是我们在启动 config.enable_profile() 的时候发现 CPU 和 GPU 之前存在不少通信，有一些变量也会放在 CPU 上，所以有没有一种方法强制在 paddle inference 将所有计算节点与参数都放在 GPU 上？

部分性能 profile:

Total time: 566354
  Computation time       Total: 151833      Ratio: 26.8089%
  Framework overhead     Total: 414521      Ratio: 73.1911%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 453535      Total: 211768      Ratio: 37.3914%
  GpuMemcpyAsync         Calls: 343740      Total: 14426.7     Ratio: 2.5473%
  GpuMemcpySync          Calls: 109795      Total: 197341      Ratio: 34.8441%

-------------------------       Event Summary       -------------------------

Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.      
thread0::tensorrt_engine                 84924       252365      66345.006053 (0.262893) 186019.876612 (0.737107)0.378767    111.662     2.97166     0.445596    
thread0::set_value                       2022        121091      111790.708466 (0.923196)9300.321439 (0.076804)  2.89463     139.663     59.8868     0.213808    
  GpuMemcpySync:GPU->CPU                 4044        111448      111440.569877 (0.999933)7.513910 (0.000067)     0.019379    110.985     27.5589     0.196782    
  GpuMemcpySync:GPU->CPU                 4044        79.3581     72.347765 (0.911662)    7.010364 (0.088338)     0.016734    3.1661      0.0196237   0.000140121 
thread0::conditional_block_infer         4044        85390.8     43963.567256 (0.514851) 41427.265607 (0.485149) 0.027776    144.553     21.1154     0.150773    
  shape                                  4044        83446.5     42027.031341 (0.503640) 41419.479982 (0.496360) 0.039785    143.98      20.6346     0.14734     
    GpuMemcpySync:GPU->CPU               4044        83350.1     41930.647902 (0.503066) 41419.479982 (0.496934) 0.02129     143.95      20.6108     0.14717     
  GpuMemcpyAsync:GPU->CPU                4044        765.706     757.920346 (0.989832)   7.785625 (0.010168)     0.015516    0.891516    0.189344    0.00135199  
  fill_constant                          8088        112.666     112.666446 (1.000000)   0.000000 (0.000000)     0.007712    0.065727    0.0139301   0.000198933 
  slice                                  4044        93.8496     93.849621 (1.000000)    0.000000 (0.000000)     0.01573     0.14972     0.0232071   0.000165708 
  unsqueeze2                             6066        84.1806     84.180586 (1.000000)    0.000000 (0.000000)     0.007983    0.075986    0.0138774   0.000148636 
  set_value                              2022        58.0157     58.015651 (1.000000)    0.000000 (0.000000)     0.020641    0.100874    0.0286922   0.000102437 
  concat                                 2022        44.041      44.041028 (1.000000)    0.000000 (0.000000)     0.012897    0.069434    0.0217809   7.77624e-05 
  tril_triu                              2022        28.3607     28.360736 (1.000000)    0.000000 (0.000000)     0.0089      0.048696    0.0140261   5.0076e-05  
  scale                                  2022        27.0725     27.072541 (1.000000)    0.000000 (0.000000)     0.009676    0.048194    0.013389    4.78015e-05 
  assign                                 2022        25.7438     25.743849 (1.000000)    0.000000 (0.000000)     0.00796     0.041135    0.0127319   4.54554e-05 
thread0::load_combine                    1           34260.8     34260.825306 (1.000000) 0.000000 (0.000000)     34260.8     34260.8     34260.8     0.0604937   
thread0::elementwise_add                 82902       32386.1     31902.488146 (0.985066) 483.641839 (0.014934)   0.041437    7.43305     0.390656    0.0571836   
  GpuMemcpySync:CPU->GPU                 82902       1662.73     1519.967576 (0.914141)  142.759372 (0.085859)   0.015683    3.60349     0.0200565   0.00293585

通过配置参数 GLOG_vmodule=operator=3，能定位到产生通信的节点，或者对应代码。尽管知道哪一行代码产生了set_value操作，但是不知道为什么会有 GPU->CPU的内存复制，或者我有办法将整个图都配置在 GPU 上运行？

部分内存迁移日志，主要是在使用 Paddle inference 和 TensorRT int8 推断时有一些节点仍然存在于 CPU。

I0807 08:28:01.381105 15684 operator.cc:1894] Transform Variable _generated_var_4 from data_type[float]:data_layout[NCHW]:place[Place(cpu)]:library_type[PLAIN] to data_type[float]:data_layout[Undefined(AnyLayout)]:place[Place(gpu:1)]:library_type[PLAIN]
I0807 08:28:01.381165 15684 operator.cc:277] Place(gpu:1) Op(elementwise_add), inputs:{X[matmul_39.tmp_0:float[1, 40, 64, 65]({})(Place(gpu:1))], Y[_generated_var_4:float[1, 1, 64, 65]({})(Place(cpu))]}, outputs:{Out[tmp_126:float[1, 40, 64, 65]({})(Place(gpu:1))]}.
I0807 08:28:01.384057 15684 operator.cc:277] Place(gpu:1) Op(shape), inputs:{Input[stack_40.tmp_0:float[40, 2, 40, 65, 128]({})(Place(gpu:1))]}, outputs:{Out[shape_6.tmp_0:int[5]({})(Place(cpu))]}.
I0807 08:28:01.384088 15684 operator.cc:277] Place(cpu) Op(slice), inputs:{EndsTensor[], EndsTensorList[], Input[shape_6.tmp_0:int[5]({})(Place(cpu))], StartsTensor[], StartsTensorList[]}, outputs:{Out[shape_6.tmp_0_slice_0:int[1]({})(Place(cpu))]}.
I0807 08:28:01.384120 15684 operator.cc:277] Place(gpu:1) Op(scale), inputs:{ScaleTensor[], X[user_id:int[1]({})(Place(gpu:1))]}, outputs:{Out[tmp_131:int[1]({})(Place(gpu:1))]}.
I0807 08:28:01.384141 15684 operator.cc:277] Place(cpu) Op(fill_constant), inputs:{ShapeTensor[], ShapeTensorList[], ValueTensor[]}, outputs:{Out[fill_constant_27.tmp_0:int[1]({})(Place(cpu))]}.
I0807 08:28:01.419982 15684 operator.cc:277] Place(gpu:1) Op(set_value), inputs:{EndsTensorList[tmp_131:int[1]({})(Place(gpu:1)), shape_6.tmp_0_slice_0:int[1]({})(Place(cpu))], Input[caches_rank_0:float[1, 40, 2, 40, 513, 128]({})(Place(gpu:1))], StartsTensorList[user_id:int[1]({})(Place(gpu:1)), fill_constant_27.tmp_0:int[1]({})(Place(cpu))], StepsTensorList[], ValueTensor[stack_40.tmp_0:float[40, 2, 40, 65, 128]({})(Place(gpu:1))]}, outputs:{Out[caches_rank_0:float[1, 40, 2, 40, 513, 128]({})(Place(gpu:1))]}.

Aug 07 '22 08:08 HoratioJSY

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

Aug 07 '22 08:08 paddle-bot[bot]

可否提供下执行预测的代码

Aug 08 '22 01:08 LDOUBLEV

下面两个函数大概是，构建predictor与执行predictor的代码，正式推断时会启动args.use_trt和args.int8。


def create_predictor(self, args):
        model_files = os.listdir(args.model_dir)
        model_name = None
        for file in model_files:
            if file.endswith(".pdmodel") and not file.startswith("int8"):
                model_name = file[:-len(".pdmodel")]
                break
        if model_name is None or not os.path.exists(os.path.join(args.model_dir, f"{model_name}.pdiparams")):
            raise ValueError(f"{args.model_dir} is not valid")


        config = paddle.inference.Config(os.path.join(args.model_dir, f"{model_name}.pdmodel"),
                                         os.path.join(args.model_dir, f"{model_name}.pdiparams"))
        config.switch_ir_optim(True)
        if self.enable_profile:
            config.enable_profile()
        if self.debug_ir:
            config.switch_ir_debug()
            config.set_optim_cache_dir(os.path.join(args.model_dir, "_opt_cache"))
        
        # set GPU configs accordingly
        config.enable_use_gpu(1024*30, args.gpu_id)
        config.enable_memory_optim(False)
        paddle.device.set_device("gpu")
        if self.collect_shape:
                config.collect_shape_range_info(os.path.join(args.model_dir, "shape_range_info.pbtxt"))
                # config.exp_enable_use_gpu_fp16() # fp16 was not supported for quantize op
        if args.use_trt:
            if args.int8:
                config.enable_tensorrt_engine(
                    workspace_size=1 << 30,
                    precision_mode=inference.PrecisionType.Int8,
                    max_batch_size=1,
                    min_subgraph_size=3,
                    use_static=True, # true to export optimizing configs
                    use_calib_mode=False)
            else:
                config.enable_tensorrt_engine(
                    workspace_size=1 << 30,
                    precision_mode=inference.PrecisionType.Float32,
                    max_batch_size=1,
                    min_subgraph_size=4,
                    use_static=False,
                    use_calib_mode=False)
            config.enable_tuned_tensorrt_dynamic_shape(os.path.join(args.model_dir, "shape_range_info.pbtxt"), True)
            # config.enable_tensorrt_oss()
            print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))
            # print("Enable TensorRT OSS is: {}".format(config.tensorrt_oss_enabled()))
        
        print(paddle.inference.Config.summary(config), flush=True)
        # paddle.inference.Config.disable_glog_info(config)
        # with paddle.fluid.device_guard("gpu"):
        predictor = paddle.inference.create_predictor(config)

        input_handles = [
            predictor.get_input_handle(name)
            for name in predictor.get_input_names()
        ]
        output_handles = [
            predictor.get_output_handle(name)
            for name in predictor.get_output_names()
        ]

        self.input_handles = input_handles
        self.output_handles = output_handles
        return predictor

 def predict_batch(self, data):
        self.input_handles[0].copy_from_cpu(data[0])
        self.input_handles[1].copy_from_cpu(data[1])
        self.input_handles[2].copy_from_cpu(data[2])
        self.input_handles[3].copy_from_cpu(data[3])
        
        self.predictor.run()
        # output = [
        #     output_handle.copy_to_cpu() for output_handle in self.output_handles
        # ]
        output = [self.output_handles[0].copy_to_cpu()]
        return output

Aug 08 '22 01:08 HoratioJSY

有些op目前不支持转换到trt，会退回到走paddle原生算子。这个需要提供下模型具体分析，才能确定是否能支持全部进入trt。

顺便问下: 目前trt int8的性能不满足需求？在什么GPU卡上，需要提升到多少才能满足需求？你们的业务场景是？

Aug 08 '22 03:08 heliqi

嗯嗯，我们是通过量化训练，再转到trt推断，trt确实会运行一些INT8子图，在A100上速度上也挺快的，只看纯计算的时间，我觉得这些都是没问题。问题在于TRT子图外面，paddle原生算子显存与内存的拷贝占用了太多时间。如果去掉我知道的那些会产生大量CPU与GPU内存拷贝的代码，那么速度上就会有比较大的提升，但是实际上并不能去掉。所以想要了解一下，inference能不能将所有Paddle原生OP都移到GPU上，不让在CPU。

Aug 08 '22 03:08 HoratioJSY

顺便问下: 目前trt int8的性能不满足需求？在什么GPU卡上，需要提升到多少才能满足需求？你们的业务场景是？

在A100上纯计算的话性能是满足的，但实际还会有一些conditional_block和set_value，目前它们不能转化到trt子图，所以需要优化这部分时间。我们的业务场景是代码生成吧，大规模NLP模型需要保证单步解码速度大概在30ms左右。

Aug 08 '22 03:08 HoratioJSY

方便提供下模型和简单运行的demo吗？能不能把op都放入trt子图需要具体针对op看下才能给你答复。你说的纯计算时间是统计了整个 predictor.run(copyfromcpu、copytocpu没统计) 还是说通过profile把各个op的计算时间加起来？

Aug 08 '22 03:08 heliqi

方便提供下模型和简单运行的demo吗？能不能把op都放入trt子图需要具体针对op看下才能给你答复。

完整的模型特别大，百亿级参数量，不太好提供运行demo；如果实在需要的话，可以初始化一个几千万参数的小模型做测试。我们之前感觉主要产生计算的OP已经放入了TRT里面，paddle 原生OP到并不是一定要放进去，因为即使放进去应该收益也不会特别大。关于不同OP的耗时，这个后面提供了两份完整的profile。

你说的纯计算时间是统计了整个 predictor.run(copyfromcpu、copytocpu没统计) 还是说通过profile把各个op的计算时间加起来？

没，只是简单的看profile中tensorrt_engine的耗时，它承担了模型backbone部分（在前一步做量化训练时只量化了matmul算子），它是最主要的耗时；其它的耗时主要就是GpuMemcpySync。

附一：在计算图内保存中间计算状态，即将一个大张量set_value到内部的一个Parameter变量，方便在下一次调用计算图时复用这个Parameter张量。


-------------------------     Overhead Summary      -------------------------

Total time: 566354
  Computation time       Total: 151833      Ratio: 26.8089%
  Framework overhead     Total: 414521      Ratio: 73.1911%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 453535      Total: 211768      Ratio: 37.3914%
  GpuMemcpyAsync         Calls: 343740      Total: 14426.7     Ratio: 2.5473%
  GpuMemcpySync          Calls: 109795      Total: 197341      Ratio: 34.8441%

-------------------------       Event Summary       -------------------------

Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.      
thread0::tensorrt_engine                 84924       252365      66345.006053 (0.262893) 186019.876612 (0.737107)0.378767    111.662     2.97166     0.445596    
thread0::set_value                       2022        121091      111790.708466 (0.923196)9300.321439 (0.076804)  2.89463     139.663     59.8868     0.213808    
  GpuMemcpySync:GPU->CPU                 4044        111448      111440.569877 (0.999933)7.513910 (0.000067)     0.019379    110.985     27.5589     0.196782    
  GpuMemcpySync:GPU->CPU                 4044        79.3581     72.347765 (0.911662)    7.010364 (0.088338)     0.016734    3.1661      0.0196237   0.000140121 
thread0::conditional_block_infer         4044        85390.8     43963.567256 (0.514851) 41427.265607 (0.485149) 0.027776    144.553     21.1154     0.150773    
  shape                                  4044        83446.5     42027.031341 (0.503640) 41419.479982 (0.496360) 0.039785    143.98      20.6346     0.14734     
    GpuMemcpySync:GPU->CPU               4044        83350.1     41930.647902 (0.503066) 41419.479982 (0.496934) 0.02129     143.95      20.6108     0.14717     
  GpuMemcpyAsync:GPU->CPU                4044        765.706     757.920346 (0.989832)   7.785625 (0.010168)     0.015516    0.891516    0.189344    0.00135199  
  fill_constant                          8088        112.666     112.666446 (1.000000)   0.000000 (0.000000)     0.007712    0.065727    0.0139301   0.000198933 
  slice                                  4044        93.8496     93.849621 (1.000000)    0.000000 (0.000000)     0.01573     0.14972     0.0232071   0.000165708 
  unsqueeze2                             6066        84.1806     84.180586 (1.000000)    0.000000 (0.000000)     0.007983    0.075986    0.0138774   0.000148636 
  set_value                              2022        58.0157     58.015651 (1.000000)    0.000000 (0.000000)     0.020641    0.100874    0.0286922   0.000102437 
  concat                                 2022        44.041      44.041028 (1.000000)    0.000000 (0.000000)     0.012897    0.069434    0.0217809   7.77624e-05 
  tril_triu                              2022        28.3607     28.360736 (1.000000)    0.000000 (0.000000)     0.0089      0.048696    0.0140261   5.0076e-05  
  scale                                  2022        27.0725     27.072541 (1.000000)    0.000000 (0.000000)     0.009676    0.048194    0.013389    4.78015e-05 
  assign                                 2022        25.7438     25.743849 (1.000000)    0.000000 (0.000000)     0.00796     0.041135    0.0127319   4.54554e-05 
thread0::load_combine                    1           34260.8     34260.825306 (1.000000) 0.000000 (0.000000)     34260.8     34260.8     34260.8     0.0604937   
thread0::elementwise_add                 82902       32386.1     31902.488146 (0.985066) 483.641839 (0.014934)   0.041437    7.43305     0.390656    0.0571836   
  GpuMemcpySync:CPU->GPU                 82902       1662.73     1519.967576 (0.914141)  142.759372 (0.085859)   0.015683    3.60349     0.0200565   0.00293585  
thread0::GpuMemcpyAsync:GPU->CPU         2022        9077.49     9056.259504 (0.997661)  21.234069 (0.002339)    0.064273    6.70269     4.48936     0.016028    
thread0::slice                           254772      7616.31     4807.662232 (0.631232)  2808.648799 (0.368768)  0.007288    3.65566     0.0298946   0.013448    
  GpuMemcpySync:GPU->CPU                 6066        139.139     127.823858 (0.918676)   11.315296 (0.081324)    0.017448    0.790898    0.0229375   0.000245675 
thread0::squeeze2                        163782      4891.37     3940.909905 (0.805686)  950.463470 (0.194314)   0.016567    6.85692     0.0298651   0.0086366   
  GpuMemcpyAsync(same_gpu):GPU->GPU      163782      2239.14     1288.679179 (0.575523)  950.463470 (0.424477)   0.008636    3.26116     0.0136715   0.00395361  
thread0::unsqueeze2                      163782      4343.96     3403.626653 (0.783531)  940.333076 (0.216469)   0.017083    8.78246     0.0265228   0.00767005  
  GpuMemcpyAsync(same_gpu):GPU->GPU      163782      2255.77     1315.438264 (0.583143)  940.333076 (0.416857)   0.008859    3.24548     0.013773    0.00398297  
thread0::matmul_v2                       82902       4255.83     1991.024797 (0.467834)  2264.807522 (0.532166)  0.021865    3.62266     0.0513357   0.00751444  
thread0::transpose2                      4044        3764.2      95.015729 (0.025242)    3669.189164 (0.974758)  0.019019    3.5454      0.930812    0.00664638  
thread0::matmul                          80880       2680.44     2204.673621 (0.822504)  475.768251 (0.177496)   0.025708    3.17326     0.033141    0.00473281  
thread0::softmax                         82902       1998.47     1419.676089 (0.710383)  578.789176 (0.289617)   0.015852    3.7436      0.0241064   0.00352865  
thread0::GpuMemcpySync:CPU->GPU          607         479.639     244.608785 (0.509985)   235.030365 (0.490015)   0.017171    254.206     0.79018     0.00084689  
thread0::fc                              2022        344.011     145.645139 (0.423374)   198.365822 (0.576626)   0.132844    0.953499    0.170134    0.000607414 
thread0::scale                           12132       343.638     295.702562 (0.860505)   47.935665 (0.139495)    0.013057    2.13108     0.0283249   0.000606755 
  GpuMemcpySync:CPU->GPU                 2022        40.326      36.942696 (0.916101)    3.383310 (0.083899)     0.016939    0.159838    0.0199436   7.12029e-05 
thread0::lookup_table_v2                 6066        174.207     98.466242 (0.565227)    75.740396 (0.434773)    0.020771    0.433557    0.0287185   0.000307593 
thread0::shape                           6066        166.182     166.182457 (1.000000)   0.000000 (0.000000)     0.005424    2.57507     0.0273957   0.000293425 
thread0::range                           2022        150.713     136.840719 (0.907956)   13.872293 (0.092044)    0.062539    0.214579    0.0745366   0.000266111 
  GpuMemcpySync:GPU->CPU                 4044        83.7343     76.447406 (0.912976)    7.286931 (0.087024)     0.017466    0.165231    0.0207058   0.000147848 
thread0::fill_constant                   8088        136.177     130.178838 (0.955950)   5.998614 (0.044050)     0.005876    1.22548     0.016837    0.000240446 
thread0::select_input                    2022        77.5685     73.690682 (0.950008)    3.877838 (0.049992)     0.03227     0.148722    0.0383623   0.000136961 
  GpuMemcpySync:GPU->CPU                 2022        57.8582     53.980349 (0.932977)    3.877838 (0.067023)     0.023765    0.140111    0.0286143   0.000102159 
thread0::cast                            2022        69.7333     63.294340 (0.907663)    6.438937 (0.092337)     0.024845    0.120938    0.0344873   0.000123127 
thread0::reshape2                        2022        65.655      58.275004 (0.887594)    7.380003 (0.112406)     0.023748    0.78456     0.0324703   0.000115926 
  GpuMemcpyAsync(same_gpu):GPU->GPU      2022        26.42       19.039993 (0.720666)    7.380003 (0.279334)     0.010362    0.047188    0.0130663   4.66493e-05 
thread0::greater_than                    2022        62.3135     54.283254 (0.871132)    8.030226 (0.128868)     0.019128    1.22689     0.0308177   0.000110026 
thread0::GpuMemcpyAsync:CPU->GPU         8088        62.2141     48.671651 (0.782325)    13.542435 (0.217675)    0.004524    1.16102     0.00769215  0.00010985  
thread0::logical_not                     2022        51.6026     44.464291 (0.861667)    7.138326 (0.138333)     0.015795    0.999258    0.0255206   9.11138e-05 
thread0::expand_as                       2022        48.2833     40.616640 (0.841214)    7.666709 (0.158786)     0.016333    0.070172    0.023879    8.5253e-05  

------------------------->    Memory Profiling Report     <-------------------------

Event                                                  Alloc Calls       Size(MB)          Free Calls        Size(MB)          
Place(cpu):set_value/compute                           4044              0.0154266         4044              0.0154266         
Place(cpu):set_value/infer_shape                       4044              0.0154266         4044              0.0154266         
Place(cpu):range/compute                               4044              0.0154266         4044              0.0154266         
Place(cpu):conditional_block_infer                     4044              0.00385666        36396             395331            
Place(cpu):load_combine/compute                        1247              49981.3           0                 0                 
Place(cpu):slice/compute                               6069              0.0231514         6066              0.02314           
Place(cpu):Unknown                                     607               2071.11           1865              52052.4           
Place(cpu):conditional_block_infer/set_value/compute   4044              0.982956          4044              0.982956          
Place(cpu):conditional_block_infer/shape/compute       8088              395325            0                 0                 
Place(cpu):conditional_block_infer/fill_constant/compute8088              1.32643           0                 0                 
Place(cpu):conditional_block_infer/concat/compute      2022              1.30595           0                 0                 
Place(cpu):fill_constant/compute                       3                 1.52588e-05       0                 0                 
Place(cpu):conditional_block_infer/slice/compute       4044              0.0154266         0                 0                 
Place(cpu):conditional_block_infer/unsqueeze2/compute  6066              1.31871           0                 0                 
Place(cpu):select_input                                2023              0.0235825         2022              0.00771332        
Place(cpu):shape/compute                               3                 4.57764e-05       0                 0                 
Place(cpu):conditional_block_infer/tril_triu/compute   2022              0.335758          0                 0                 
Place(cpu):conditional_block_infer/scale/compute       2022              1.30595           0                 0                 
Place(cpu):conditional_block_infer/assign/compute      1                 0.0158691         0                 0

附二：在计算图中不保存中间计算状态，每次重新计算


------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

-------------------------     Overhead Summary      -------------------------

Total time: 245947
  Computation time       Total: 53450.3     Ratio: 21.7324%
  Framework overhead     Total: 192497      Ratio: 78.2676%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 279443      Total: 11978.8     Ratio: 4.87049%
  GpuMemcpyAsync         Calls: 181980      Total: 9003.71     Ratio: 3.66083%
  GpuMemcpySync          Calls: 97463       Total: 2975.13     Ratio: 1.20966%

-------------------------       Event Summary       -------------------------

Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.      
thread0::tensorrt_engine                 82902       143176      63242.445148 (0.441712) 79933.290073 (0.558288) 0.372081    109.478     1.72705     0.582139    
thread0::elementwise_add                 82902       36840.2     36198.357173 (0.982578) 641.811389 (0.017422)   0.039618    6.56421     0.444382    0.149789    
  GpuMemcpySync:CPU->GPU                 82902       2179        1925.564775 (0.883693)  253.432582 (0.116307)   0.015005    3.17831     0.026284    0.0088596   
thread0::load_combine                    1           26269.4     26269.424771 (1.000000) 0.000000 (0.000000)     26269.4     26269.4     26269.4     0.106809    
thread0::slice                           252750      8100.39     4986.685972 (0.615610)  3113.706453 (0.384390)  0.006148    110.774     0.032049    0.0329355   
  GpuMemcpySync:GPU->CPU                 6066        139.444     128.132557 (0.918881)   11.311611 (0.081119)    0.01784     0.33543     0.0229878   0.000566967 
thread0::matmul_v2                       82902       6353.23     3322.985853 (0.523039)  3030.239454 (0.476961)  0.040063    3.06378     0.0766354   0.0258316   
thread0::GpuMemcpyAsync:GPU->CPU         2022        5540.07     5518.924731 (0.996184)  21.141026 (0.003816)    0.260396    3.6533      2.73989     0.0225254   
thread0::matmul                          80880       5381.54     4234.615552 (0.786877)  1146.928714 (0.213123)  0.055669    3.19345     0.0665374   0.0218809   
thread0::unsqueeze2                      163782      4578.76     3589.140329 (0.783868)  989.615144 (0.216132)   0.018327    3.56209     0.0279564   0.0186168   
  GpuMemcpyAsync(same_gpu):GPU->GPU      163782      2501.24     1511.623888 (0.604350)  989.615144 (0.395650)   0.009447    2.92806     0.0152718   0.0101698   
thread0::transpose2                      4044        3750.05     82.058179 (0.021882)    3667.990738 (0.978118)  0.019812    3.12334     0.927312    0.0152474   
thread0::softmax                         82902       2288.5      1646.105627 (0.719294)  642.397115 (0.280706)   0.019016    105.769     0.0276049   0.00930484  
thread0::conditional_block_infer         4044        1625.12     1617.305474 (0.995189)  7.818137 (0.004811)     0.03039     3.53453     0.40186     0.0066076   
  GpuMemcpyAsync:GPU->CPU                4044        802.738     794.919437 (0.990261)   7.818137 (0.009739)     0.015957    0.811556    0.198501    0.00326386  
  fill_constant                          8088        115.659     115.659384 (1.000000)   0.000000 (0.000000)     0.008251    0.052285    0.0143001   0.00047026  
  unsqueeze2                             6066        77.035      77.034982 (1.000000)    0.000000 (0.000000)     0.008624    0.037673    0.0126995   0.000313217 
  set_value                              2022        58.2604     58.260425 (1.000000)    0.000000 (0.000000)     0.022208    0.084059    0.0288133   0.000236882 
  concat                                 2022        39.7301     39.730066 (1.000000)    0.000000 (0.000000)     0.013515    0.050971    0.0196489   0.000161539 
  tril_triu                              2022        38.555      38.555046 (1.000000)    0.000000 (0.000000)     0.015954    0.050889    0.0190678   0.000156761 
  scale                                  2022        26.9411     26.941111 (1.000000)    0.000000 (0.000000)     0.009992    0.034922    0.013324    0.00010954  
  assign                                 2022        24.6444     24.644380 (1.000000)    0.000000 (0.000000)     0.008514    0.030619    0.0121881   0.000100202 
thread0::GpuMemcpySync:CPU->GPU          407         474.209     240.353833 (0.506852)   233.854844 (0.493148)   0.016517    253.073     1.16513     0.00192809  
thread0::fc                              2022        318.223     112.565244 (0.353731)   205.657742 (0.646269)   0.138536    0.833114    0.15738     0.00129387  
thread0::lookup_table_v2                 6066        178.352     90.775715 (0.508971)    87.575820 (0.491029)    0.023212    1.00001     0.0294018   0.000725161 
thread0::scale                           8088        167.512     138.797163 (0.828579)   28.715087 (0.171421)    0.013476    0.283314    0.0207112   0.000681089 
thread0::range                           2022        152.811     138.950435 (0.909298)   13.860254 (0.090702)    0.06809     0.146151    0.075574    0.000621314 
  GpuMemcpySync:GPU->CPU                 4044        85.6418     78.242983 (0.913608)    7.398799 (0.086392)     0.018069    0.040667    0.0211775   0.000348212 
thread0::elementwise_sub                 2022        112.445     102.259070 (0.909416)   10.185748 (0.090584)    0.048334    0.146846    0.0556107   0.00045719  
  GpuMemcpySync:CPU->GPU                 2022        38.93       35.577386 (0.913882)    3.352580 (0.086118)     0.016067    0.028943    0.0192532   0.000158286 
thread0::fill_constant                   6066        104.941     98.949840 (0.942907)    5.991448 (0.057093)     0.00667     0.60983     0.0172999   0.000426682 
thread0::GpuMemcpyAsync:CPU->GPU         8088        100.806     87.153560 (0.864571)    13.652009 (0.135429)    0.004482    0.615412    0.0124636   0.000409866 
thread0::select_input                    2022        79.2093     75.291260 (0.950536)    3.918028 (0.049464)     0.034303    1.35978     0.0391737   0.000322058 
  GpuMemcpySync:GPU->CPU                 2022        57.9079     53.989840 (0.932340)    3.918028 (0.067660)     0.025363    1.34757     0.0286389   0.000235448 
thread0::reshape2                        2022        68.883      60.470779 (0.877877)    8.412180 (0.122123)     0.027323    3.01314     0.0340667   0.000280072 
  GpuMemcpyAsync(same_gpu):GPU->GPU      2022        29.2434     20.831195 (0.712339)    8.412180 (0.287661)     0.012459    0.027904    0.0144626   0.000118901 
thread0::squeeze2                        2022        61.7655     53.694293 (0.869325)    8.071191 (0.130675)     0.025236    0.071372    0.0305467   0.000251133 
  GpuMemcpyAsync(same_gpu):GPU->GPU      2022        29.6225     21.551295 (0.727532)    8.071191 (0.272468)     0.012501    0.027986    0.0146501   0.000120442 
thread0::cast                            2022        56.1319     49.830211 (0.887734)    6.301711 (0.112266)     0.021669    0.078868    0.0277606   0.000228227 
thread0::greater_than                    2022        48.5322     40.577730 (0.836099)    7.954461 (0.163901)     0.019102    0.078796    0.0240021   0.000197327 
thread0::logical_not                     2022        43.9498     36.812535 (0.837603)    7.137301 (0.162397)     0.017222    0.057402    0.0217358   0.000178696 
thread0::expand_as                       2022        42.5417     35.653199 (0.838077)    6.888478 (0.161923)     0.016457    0.07021     0.0210394   0.000172971 
thread0::shape                           4044        34.1966     34.196558 (1.000000)    0.000000 (0.000000)     0.004634    0.267494    0.00845612  0.00013904  

------------------------->    Memory Profiling Report     <-------------------------

Event                                                  Alloc Calls       Size(MB)          Free Calls        Size(MB)          
Place(cpu):conditional_block_infer/assign/compute      7                 0.192383          6                 0.153198          
Place(cpu):range/compute                               4044              0.0154266         4044              0.0154266         
Place(cpu):conditional_block_infer/scale/compute       2022              43.5502           0                 0                 
Place(cpu):conditional_block_infer/tril_triu/compute   2022              8.14453           0                 0                 
Place(cpu):conditional_block_infer/concat/compute      2022              43.5502           0                 0                 
Place(cpu):conditional_block_infer/fill_constant/compute8088              43.8073           0                 0                 
Place(cpu):fill_constant/compute                       2                 1.14441e-05       0                 0                 
Place(cpu):select_input                                2029              0.200096          2028              0.160912          
Place(cpu):conditional_block_infer/unsqueeze2/compute  6066              43.7996           0                 0                 
Place(cpu):shape/compute                               2                 2.67029e-05       0                 0                 
Place(cpu):conditional_block_infer/set_value/compute   4044              35.655            4044              35.655            
Place(cpu):Unknown                                     407               2071.11           1462              52052.5           
Place(cpu):slice/compute                               6068              0.0231476         6066              0.02314           
Place(cpu):load_combine/compute                        1047              49981.3           0                 0                 
Place(cpu):conditional_block_infer                     4044              0.00385666        24264             182.856

Aug 08 '22 04:08 HoratioJSY

你使用的paddle inference库是多少版本？
我看你有使用过debug模式(config.switch_ir_debug()), 能用visualdl或netron看看经过所有pass之后的图吗(即“_opt_cache”目录里数字最大的那个 xx_xxx.pdmodel)
可以参考这个文档，用 NVIDIA Nsight Systems 可视化图确认下 : GpuMemcpyAsync:GPU->CPU 主要由哪些算子导致的 ,上面的profile不能确定。

如果在工具使用上有一些问题，方便的话可以加我qq或微信咨询

Aug 08 '22 06:08 heliqi

嗯嗯，感谢，paddle 版本是2.3.0，“_opt_cache” 里面最大的是27_ir_transpose_flatten_concat_fuse_pass.dot。之前尝试用过NVIDIA Nsight Systems，也能可视化计算时间，但好像有点看不懂，因为我知道哪一行代码会产生set_value，就没继续看了，我可以再试试。能问一下您的联系方式么。。

Aug 08 '22 07:08 HoratioJSY

不用看 .dot，可视化看.pdmodel。这个是经过pass之后保存的模型图

Aug 08 '22 08:08 heliqi

哦，我看错了，以为前面问有多少个.pdmodel，所以贴上了带数字的那个文件。能加一下您的联系方式么，效率高一些。。

Aug 08 '22 08:08 HoratioJSY

xxx

Aug 08 '22 09:08 heliqi

嗯嗯，感谢

Aug 08 '22 09:08 HoratioJSY

在我的模型中，paddle inference在启动TensorRT进行推断时存在大量CPU与GPU内存之间的拷贝，可以通过更新paddle inference 到 develop 的最新版解决。

Aug 10 '22 09:08 HoratioJSY

Paddle Paddle copied to clipboard

Paddle Inference 将所有参数与 OP放在 GPU 上推理模型

请提出你的问题 Please ask your question

Paddle
Paddle copied to clipboard