Paddle
Paddle copied to clipboard
Paddle Inference 将所有参数与 OP放在 GPU 上推理模型
请提出你的问题 Please ask your question
在 Paddle inference 中,如果配置 paddle.inference.Config.enable_use_gpu() 能够启动 GPU 推理,但是我们在启动 config.enable_profile() 的时候发现 CPU 和 GPU 之前存在不少通信,有一些变量也会放在 CPU 上,所以有没有一种方法强制在 paddle inference 将所有计算节点与参数都放在 GPU 上?
部分性能 profile:
Total time: 566354
Computation time Total: 151833 Ratio: 26.8089%
Framework overhead Total: 414521 Ratio: 73.1911%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 453535 Total: 211768 Ratio: 37.3914%
GpuMemcpyAsync Calls: 343740 Total: 14426.7 Ratio: 2.5473%
GpuMemcpySync Calls: 109795 Total: 197341 Ratio: 34.8441%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::tensorrt_engine 84924 252365 66345.006053 (0.262893) 186019.876612 (0.737107)0.378767 111.662 2.97166 0.445596
thread0::set_value 2022 121091 111790.708466 (0.923196)9300.321439 (0.076804) 2.89463 139.663 59.8868 0.213808
GpuMemcpySync:GPU->CPU 4044 111448 111440.569877 (0.999933)7.513910 (0.000067) 0.019379 110.985 27.5589 0.196782
GpuMemcpySync:GPU->CPU 4044 79.3581 72.347765 (0.911662) 7.010364 (0.088338) 0.016734 3.1661 0.0196237 0.000140121
thread0::conditional_block_infer 4044 85390.8 43963.567256 (0.514851) 41427.265607 (0.485149) 0.027776 144.553 21.1154 0.150773
shape 4044 83446.5 42027.031341 (0.503640) 41419.479982 (0.496360) 0.039785 143.98 20.6346 0.14734
GpuMemcpySync:GPU->CPU 4044 83350.1 41930.647902 (0.503066) 41419.479982 (0.496934) 0.02129 143.95 20.6108 0.14717
GpuMemcpyAsync:GPU->CPU 4044 765.706 757.920346 (0.989832) 7.785625 (0.010168) 0.015516 0.891516 0.189344 0.00135199
fill_constant 8088 112.666 112.666446 (1.000000) 0.000000 (0.000000) 0.007712 0.065727 0.0139301 0.000198933
slice 4044 93.8496 93.849621 (1.000000) 0.000000 (0.000000) 0.01573 0.14972 0.0232071 0.000165708
unsqueeze2 6066 84.1806 84.180586 (1.000000) 0.000000 (0.000000) 0.007983 0.075986 0.0138774 0.000148636
set_value 2022 58.0157 58.015651 (1.000000) 0.000000 (0.000000) 0.020641 0.100874 0.0286922 0.000102437
concat 2022 44.041 44.041028 (1.000000) 0.000000 (0.000000) 0.012897 0.069434 0.0217809 7.77624e-05
tril_triu 2022 28.3607 28.360736 (1.000000) 0.000000 (0.000000) 0.0089 0.048696 0.0140261 5.0076e-05
scale 2022 27.0725 27.072541 (1.000000) 0.000000 (0.000000) 0.009676 0.048194 0.013389 4.78015e-05
assign 2022 25.7438 25.743849 (1.000000) 0.000000 (0.000000) 0.00796 0.041135 0.0127319 4.54554e-05
thread0::load_combine 1 34260.8 34260.825306 (1.000000) 0.000000 (0.000000) 34260.8 34260.8 34260.8 0.0604937
thread0::elementwise_add 82902 32386.1 31902.488146 (0.985066) 483.641839 (0.014934) 0.041437 7.43305 0.390656 0.0571836
GpuMemcpySync:CPU->GPU 82902 1662.73 1519.967576 (0.914141) 142.759372 (0.085859) 0.015683 3.60349 0.0200565 0.00293585
通过配置参数 GLOG_vmodule=operator=3,能定位到产生通信的节点,或者对应代码。尽管知道哪一行代码产生了set_value操作,但是不知道为什么会有 GPU->CPU的内存复制,或者我有办法将整个图都配置在 GPU 上运行?
部分内存迁移日志,主要是在使用 Paddle inference 和 TensorRT int8 推断时有一些节点仍然存在于 CPU。
I0807 08:28:01.381105 15684 operator.cc:1894] Transform Variable _generated_var_4 from data_type[float]:data_layout[NCHW]:place[Place(cpu)]:library_type[PLAIN] to data_type[float]:data_layout[Undefined(AnyLayout)]:place[Place(gpu:1)]:library_type[PLAIN]
I0807 08:28:01.381165 15684 operator.cc:277] Place(gpu:1) Op(elementwise_add), inputs:{X[matmul_39.tmp_0:float[1, 40, 64, 65]({})(Place(gpu:1))], Y[_generated_var_4:float[1, 1, 64, 65]({})(Place(cpu))]}, outputs:{Out[tmp_126:float[1, 40, 64, 65]({})(Place(gpu:1))]}.
I0807 08:28:01.384057 15684 operator.cc:277] Place(gpu:1) Op(shape), inputs:{Input[stack_40.tmp_0:float[40, 2, 40, 65, 128]({})(Place(gpu:1))]}, outputs:{Out[shape_6.tmp_0:int[5]({})(Place(cpu))]}.
I0807 08:28:01.384088 15684 operator.cc:277] Place(cpu) Op(slice), inputs:{EndsTensor[], EndsTensorList[], Input[shape_6.tmp_0:int[5]({})(Place(cpu))], StartsTensor[], StartsTensorList[]}, outputs:{Out[shape_6.tmp_0_slice_0:int[1]({})(Place(cpu))]}.
I0807 08:28:01.384120 15684 operator.cc:277] Place(gpu:1) Op(scale), inputs:{ScaleTensor[], X[user_id:int[1]({})(Place(gpu:1))]}, outputs:{Out[tmp_131:int[1]({})(Place(gpu:1))]}.
I0807 08:28:01.384141 15684 operator.cc:277] Place(cpu) Op(fill_constant), inputs:{ShapeTensor[], ShapeTensorList[], ValueTensor[]}, outputs:{Out[fill_constant_27.tmp_0:int[1]({})(Place(cpu))]}.
I0807 08:28:01.419982 15684 operator.cc:277] Place(gpu:1) Op(set_value), inputs:{EndsTensorList[tmp_131:int[1]({})(Place(gpu:1)), shape_6.tmp_0_slice_0:int[1]({})(Place(cpu))], Input[caches_rank_0:float[1, 40, 2, 40, 513, 128]({})(Place(gpu:1))], StartsTensorList[user_id:int[1]({})(Place(gpu:1)), fill_constant_27.tmp_0:int[1]({})(Place(cpu))], StepsTensorList[], ValueTensor[stack_40.tmp_0:float[40, 2, 40, 65, 128]({})(Place(gpu:1))]}, outputs:{Out[caches_rank_0:float[1, 40, 2, 40, 513, 128]({})(Place(gpu:1))]}.
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day!
可否提供下执行预测的代码
下面两个函数大概是,构建predictor与执行predictor的代码,正式推断时会启动args.use_trt和args.int8。
def create_predictor(self, args):
model_files = os.listdir(args.model_dir)
model_name = None
for file in model_files:
if file.endswith(".pdmodel") and not file.startswith("int8"):
model_name = file[:-len(".pdmodel")]
break
if model_name is None or not os.path.exists(os.path.join(args.model_dir, f"{model_name}.pdiparams")):
raise ValueError(f"{args.model_dir} is not valid")
config = paddle.inference.Config(os.path.join(args.model_dir, f"{model_name}.pdmodel"),
os.path.join(args.model_dir, f"{model_name}.pdiparams"))
config.switch_ir_optim(True)
if self.enable_profile:
config.enable_profile()
if self.debug_ir:
config.switch_ir_debug()
config.set_optim_cache_dir(os.path.join(args.model_dir, "_opt_cache"))
# set GPU configs accordingly
config.enable_use_gpu(1024*30, args.gpu_id)
config.enable_memory_optim(False)
paddle.device.set_device("gpu")
if self.collect_shape:
config.collect_shape_range_info(os.path.join(args.model_dir, "shape_range_info.pbtxt"))
# config.exp_enable_use_gpu_fp16() # fp16 was not supported for quantize op
if args.use_trt:
if args.int8:
config.enable_tensorrt_engine(
workspace_size=1 << 30,
precision_mode=inference.PrecisionType.Int8,
max_batch_size=1,
min_subgraph_size=3,
use_static=True, # true to export optimizing configs
use_calib_mode=False)
else:
config.enable_tensorrt_engine(
workspace_size=1 << 30,
precision_mode=inference.PrecisionType.Float32,
max_batch_size=1,
min_subgraph_size=4,
use_static=False,
use_calib_mode=False)
config.enable_tuned_tensorrt_dynamic_shape(os.path.join(args.model_dir, "shape_range_info.pbtxt"), True)
# config.enable_tensorrt_oss()
print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))
# print("Enable TensorRT OSS is: {}".format(config.tensorrt_oss_enabled()))
print(paddle.inference.Config.summary(config), flush=True)
# paddle.inference.Config.disable_glog_info(config)
# with paddle.fluid.device_guard("gpu"):
predictor = paddle.inference.create_predictor(config)
input_handles = [
predictor.get_input_handle(name)
for name in predictor.get_input_names()
]
output_handles = [
predictor.get_output_handle(name)
for name in predictor.get_output_names()
]
self.input_handles = input_handles
self.output_handles = output_handles
return predictor
def predict_batch(self, data):
self.input_handles[0].copy_from_cpu(data[0])
self.input_handles[1].copy_from_cpu(data[1])
self.input_handles[2].copy_from_cpu(data[2])
self.input_handles[3].copy_from_cpu(data[3])
self.predictor.run()
# output = [
# output_handle.copy_to_cpu() for output_handle in self.output_handles
# ]
output = [self.output_handles[0].copy_to_cpu()]
return output
有些op目前不支持转换到trt,会退回到走paddle原生算子。这个需要提供下模型具体分析,才能确定是否能支持全部进入trt。
顺便问下: 目前trt int8的性能不满足需求? 在什么GPU卡上,需要提升到多少才能满足需求?你们的业务场景是?
嗯嗯,我们是通过量化训练,再转到trt推断,trt确实会运行一些INT8子图,在A100上速度上也挺快的,只看纯计算的时间,我觉得这些都是没问题。问题在于TRT子图外面,paddle原生算子显存与内存的拷贝占用了太多时间。如果去掉我知道的那些会产生大量CPU与GPU内存拷贝的代码,那么速度上就会有比较大的提升,但是实际上并不能去掉。所以想要了解一下,inference能不能将所有Paddle原生OP都移到GPU上,不让在CPU。
顺便问下: 目前trt int8的性能不满足需求? 在什么GPU卡上,需要提升到多少才能满足需求?你们的业务场景是?
在A100上纯计算的话性能是满足的,但实际还会有一些conditional_block和set_value,目前它们不能转化到trt子图,所以需要优化这部分时间。我们的业务场景是代码生成吧,大规模NLP模型需要保证单步解码速度大概在30ms左右。
方便提供下模型和简单运行的demo吗? 能不能把op都放入trt子图需要具体针对op看下才能给你答复。 你说的纯计算时间是统计了整个 predictor.run(copyfromcpu、copytocpu没统计) 还是说通过profile把各个op的计算时间加起来?
方便提供下模型和简单运行的demo吗? 能不能把op都放入trt子图需要具体针对op看下才能给你答复。
完整的模型特别大,百亿级参数量,不太好提供运行demo;如果实在需要的话,可以初始化一个几千万参数的小模型做测试。我们之前感觉主要产生计算的OP已经放入了TRT里面,paddle 原生OP到并不是一定要放进去,因为即使放进去应该收益也不会特别大。关于不同OP的耗时,这个后面提供了两份完整的profile。
你说的纯计算时间是统计了整个 predictor.run(copyfromcpu、copytocpu没统计) 还是说通过profile把各个op的计算时间加起来?
没,只是简单的看profile中tensorrt_engine的耗时,它承担了模型backbone部分(在前一步做量化训练时只量化了matmul算子),它是最主要的耗时;其它的耗时主要就是GpuMemcpySync。
附一:在计算图内保存中间计算状态,即将一个大张量set_value到内部的一个Parameter变量,方便在下一次调用计算图时复用这个Parameter张量。
------------------------- Overhead Summary -------------------------
Total time: 566354
Computation time Total: 151833 Ratio: 26.8089%
Framework overhead Total: 414521 Ratio: 73.1911%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 453535 Total: 211768 Ratio: 37.3914%
GpuMemcpyAsync Calls: 343740 Total: 14426.7 Ratio: 2.5473%
GpuMemcpySync Calls: 109795 Total: 197341 Ratio: 34.8441%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::tensorrt_engine 84924 252365 66345.006053 (0.262893) 186019.876612 (0.737107)0.378767 111.662 2.97166 0.445596
thread0::set_value 2022 121091 111790.708466 (0.923196)9300.321439 (0.076804) 2.89463 139.663 59.8868 0.213808
GpuMemcpySync:GPU->CPU 4044 111448 111440.569877 (0.999933)7.513910 (0.000067) 0.019379 110.985 27.5589 0.196782
GpuMemcpySync:GPU->CPU 4044 79.3581 72.347765 (0.911662) 7.010364 (0.088338) 0.016734 3.1661 0.0196237 0.000140121
thread0::conditional_block_infer 4044 85390.8 43963.567256 (0.514851) 41427.265607 (0.485149) 0.027776 144.553 21.1154 0.150773
shape 4044 83446.5 42027.031341 (0.503640) 41419.479982 (0.496360) 0.039785 143.98 20.6346 0.14734
GpuMemcpySync:GPU->CPU 4044 83350.1 41930.647902 (0.503066) 41419.479982 (0.496934) 0.02129 143.95 20.6108 0.14717
GpuMemcpyAsync:GPU->CPU 4044 765.706 757.920346 (0.989832) 7.785625 (0.010168) 0.015516 0.891516 0.189344 0.00135199
fill_constant 8088 112.666 112.666446 (1.000000) 0.000000 (0.000000) 0.007712 0.065727 0.0139301 0.000198933
slice 4044 93.8496 93.849621 (1.000000) 0.000000 (0.000000) 0.01573 0.14972 0.0232071 0.000165708
unsqueeze2 6066 84.1806 84.180586 (1.000000) 0.000000 (0.000000) 0.007983 0.075986 0.0138774 0.000148636
set_value 2022 58.0157 58.015651 (1.000000) 0.000000 (0.000000) 0.020641 0.100874 0.0286922 0.000102437
concat 2022 44.041 44.041028 (1.000000) 0.000000 (0.000000) 0.012897 0.069434 0.0217809 7.77624e-05
tril_triu 2022 28.3607 28.360736 (1.000000) 0.000000 (0.000000) 0.0089 0.048696 0.0140261 5.0076e-05
scale 2022 27.0725 27.072541 (1.000000) 0.000000 (0.000000) 0.009676 0.048194 0.013389 4.78015e-05
assign 2022 25.7438 25.743849 (1.000000) 0.000000 (0.000000) 0.00796 0.041135 0.0127319 4.54554e-05
thread0::load_combine 1 34260.8 34260.825306 (1.000000) 0.000000 (0.000000) 34260.8 34260.8 34260.8 0.0604937
thread0::elementwise_add 82902 32386.1 31902.488146 (0.985066) 483.641839 (0.014934) 0.041437 7.43305 0.390656 0.0571836
GpuMemcpySync:CPU->GPU 82902 1662.73 1519.967576 (0.914141) 142.759372 (0.085859) 0.015683 3.60349 0.0200565 0.00293585
thread0::GpuMemcpyAsync:GPU->CPU 2022 9077.49 9056.259504 (0.997661) 21.234069 (0.002339) 0.064273 6.70269 4.48936 0.016028
thread0::slice 254772 7616.31 4807.662232 (0.631232) 2808.648799 (0.368768) 0.007288 3.65566 0.0298946 0.013448
GpuMemcpySync:GPU->CPU 6066 139.139 127.823858 (0.918676) 11.315296 (0.081324) 0.017448 0.790898 0.0229375 0.000245675
thread0::squeeze2 163782 4891.37 3940.909905 (0.805686) 950.463470 (0.194314) 0.016567 6.85692 0.0298651 0.0086366
GpuMemcpyAsync(same_gpu):GPU->GPU 163782 2239.14 1288.679179 (0.575523) 950.463470 (0.424477) 0.008636 3.26116 0.0136715 0.00395361
thread0::unsqueeze2 163782 4343.96 3403.626653 (0.783531) 940.333076 (0.216469) 0.017083 8.78246 0.0265228 0.00767005
GpuMemcpyAsync(same_gpu):GPU->GPU 163782 2255.77 1315.438264 (0.583143) 940.333076 (0.416857) 0.008859 3.24548 0.013773 0.00398297
thread0::matmul_v2 82902 4255.83 1991.024797 (0.467834) 2264.807522 (0.532166) 0.021865 3.62266 0.0513357 0.00751444
thread0::transpose2 4044 3764.2 95.015729 (0.025242) 3669.189164 (0.974758) 0.019019 3.5454 0.930812 0.00664638
thread0::matmul 80880 2680.44 2204.673621 (0.822504) 475.768251 (0.177496) 0.025708 3.17326 0.033141 0.00473281
thread0::softmax 82902 1998.47 1419.676089 (0.710383) 578.789176 (0.289617) 0.015852 3.7436 0.0241064 0.00352865
thread0::GpuMemcpySync:CPU->GPU 607 479.639 244.608785 (0.509985) 235.030365 (0.490015) 0.017171 254.206 0.79018 0.00084689
thread0::fc 2022 344.011 145.645139 (0.423374) 198.365822 (0.576626) 0.132844 0.953499 0.170134 0.000607414
thread0::scale 12132 343.638 295.702562 (0.860505) 47.935665 (0.139495) 0.013057 2.13108 0.0283249 0.000606755
GpuMemcpySync:CPU->GPU 2022 40.326 36.942696 (0.916101) 3.383310 (0.083899) 0.016939 0.159838 0.0199436 7.12029e-05
thread0::lookup_table_v2 6066 174.207 98.466242 (0.565227) 75.740396 (0.434773) 0.020771 0.433557 0.0287185 0.000307593
thread0::shape 6066 166.182 166.182457 (1.000000) 0.000000 (0.000000) 0.005424 2.57507 0.0273957 0.000293425
thread0::range 2022 150.713 136.840719 (0.907956) 13.872293 (0.092044) 0.062539 0.214579 0.0745366 0.000266111
GpuMemcpySync:GPU->CPU 4044 83.7343 76.447406 (0.912976) 7.286931 (0.087024) 0.017466 0.165231 0.0207058 0.000147848
thread0::fill_constant 8088 136.177 130.178838 (0.955950) 5.998614 (0.044050) 0.005876 1.22548 0.016837 0.000240446
thread0::select_input 2022 77.5685 73.690682 (0.950008) 3.877838 (0.049992) 0.03227 0.148722 0.0383623 0.000136961
GpuMemcpySync:GPU->CPU 2022 57.8582 53.980349 (0.932977) 3.877838 (0.067023) 0.023765 0.140111 0.0286143 0.000102159
thread0::cast 2022 69.7333 63.294340 (0.907663) 6.438937 (0.092337) 0.024845 0.120938 0.0344873 0.000123127
thread0::reshape2 2022 65.655 58.275004 (0.887594) 7.380003 (0.112406) 0.023748 0.78456 0.0324703 0.000115926
GpuMemcpyAsync(same_gpu):GPU->GPU 2022 26.42 19.039993 (0.720666) 7.380003 (0.279334) 0.010362 0.047188 0.0130663 4.66493e-05
thread0::greater_than 2022 62.3135 54.283254 (0.871132) 8.030226 (0.128868) 0.019128 1.22689 0.0308177 0.000110026
thread0::GpuMemcpyAsync:CPU->GPU 8088 62.2141 48.671651 (0.782325) 13.542435 (0.217675) 0.004524 1.16102 0.00769215 0.00010985
thread0::logical_not 2022 51.6026 44.464291 (0.861667) 7.138326 (0.138333) 0.015795 0.999258 0.0255206 9.11138e-05
thread0::expand_as 2022 48.2833 40.616640 (0.841214) 7.666709 (0.158786) 0.016333 0.070172 0.023879 8.5253e-05
-------------------------> Memory Profiling Report <-------------------------
Event Alloc Calls Size(MB) Free Calls Size(MB)
Place(cpu):set_value/compute 4044 0.0154266 4044 0.0154266
Place(cpu):set_value/infer_shape 4044 0.0154266 4044 0.0154266
Place(cpu):range/compute 4044 0.0154266 4044 0.0154266
Place(cpu):conditional_block_infer 4044 0.00385666 36396 395331
Place(cpu):load_combine/compute 1247 49981.3 0 0
Place(cpu):slice/compute 6069 0.0231514 6066 0.02314
Place(cpu):Unknown 607 2071.11 1865 52052.4
Place(cpu):conditional_block_infer/set_value/compute 4044 0.982956 4044 0.982956
Place(cpu):conditional_block_infer/shape/compute 8088 395325 0 0
Place(cpu):conditional_block_infer/fill_constant/compute8088 1.32643 0 0
Place(cpu):conditional_block_infer/concat/compute 2022 1.30595 0 0
Place(cpu):fill_constant/compute 3 1.52588e-05 0 0
Place(cpu):conditional_block_infer/slice/compute 4044 0.0154266 0 0
Place(cpu):conditional_block_infer/unsqueeze2/compute 6066 1.31871 0 0
Place(cpu):select_input 2023 0.0235825 2022 0.00771332
Place(cpu):shape/compute 3 4.57764e-05 0 0
Place(cpu):conditional_block_infer/tril_triu/compute 2022 0.335758 0 0
Place(cpu):conditional_block_infer/scale/compute 2022 1.30595 0 0
Place(cpu):conditional_block_infer/assign/compute 1 0.0158691 0 0
附二:在计算图中不保存中间计算状态,每次重新计算
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
------------------------- Overhead Summary -------------------------
Total time: 245947
Computation time Total: 53450.3 Ratio: 21.7324%
Framework overhead Total: 192497 Ratio: 78.2676%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 279443 Total: 11978.8 Ratio: 4.87049%
GpuMemcpyAsync Calls: 181980 Total: 9003.71 Ratio: 3.66083%
GpuMemcpySync Calls: 97463 Total: 2975.13 Ratio: 1.20966%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::tensorrt_engine 82902 143176 63242.445148 (0.441712) 79933.290073 (0.558288) 0.372081 109.478 1.72705 0.582139
thread0::elementwise_add 82902 36840.2 36198.357173 (0.982578) 641.811389 (0.017422) 0.039618 6.56421 0.444382 0.149789
GpuMemcpySync:CPU->GPU 82902 2179 1925.564775 (0.883693) 253.432582 (0.116307) 0.015005 3.17831 0.026284 0.0088596
thread0::load_combine 1 26269.4 26269.424771 (1.000000) 0.000000 (0.000000) 26269.4 26269.4 26269.4 0.106809
thread0::slice 252750 8100.39 4986.685972 (0.615610) 3113.706453 (0.384390) 0.006148 110.774 0.032049 0.0329355
GpuMemcpySync:GPU->CPU 6066 139.444 128.132557 (0.918881) 11.311611 (0.081119) 0.01784 0.33543 0.0229878 0.000566967
thread0::matmul_v2 82902 6353.23 3322.985853 (0.523039) 3030.239454 (0.476961) 0.040063 3.06378 0.0766354 0.0258316
thread0::GpuMemcpyAsync:GPU->CPU 2022 5540.07 5518.924731 (0.996184) 21.141026 (0.003816) 0.260396 3.6533 2.73989 0.0225254
thread0::matmul 80880 5381.54 4234.615552 (0.786877) 1146.928714 (0.213123) 0.055669 3.19345 0.0665374 0.0218809
thread0::unsqueeze2 163782 4578.76 3589.140329 (0.783868) 989.615144 (0.216132) 0.018327 3.56209 0.0279564 0.0186168
GpuMemcpyAsync(same_gpu):GPU->GPU 163782 2501.24 1511.623888 (0.604350) 989.615144 (0.395650) 0.009447 2.92806 0.0152718 0.0101698
thread0::transpose2 4044 3750.05 82.058179 (0.021882) 3667.990738 (0.978118) 0.019812 3.12334 0.927312 0.0152474
thread0::softmax 82902 2288.5 1646.105627 (0.719294) 642.397115 (0.280706) 0.019016 105.769 0.0276049 0.00930484
thread0::conditional_block_infer 4044 1625.12 1617.305474 (0.995189) 7.818137 (0.004811) 0.03039 3.53453 0.40186 0.0066076
GpuMemcpyAsync:GPU->CPU 4044 802.738 794.919437 (0.990261) 7.818137 (0.009739) 0.015957 0.811556 0.198501 0.00326386
fill_constant 8088 115.659 115.659384 (1.000000) 0.000000 (0.000000) 0.008251 0.052285 0.0143001 0.00047026
unsqueeze2 6066 77.035 77.034982 (1.000000) 0.000000 (0.000000) 0.008624 0.037673 0.0126995 0.000313217
set_value 2022 58.2604 58.260425 (1.000000) 0.000000 (0.000000) 0.022208 0.084059 0.0288133 0.000236882
concat 2022 39.7301 39.730066 (1.000000) 0.000000 (0.000000) 0.013515 0.050971 0.0196489 0.000161539
tril_triu 2022 38.555 38.555046 (1.000000) 0.000000 (0.000000) 0.015954 0.050889 0.0190678 0.000156761
scale 2022 26.9411 26.941111 (1.000000) 0.000000 (0.000000) 0.009992 0.034922 0.013324 0.00010954
assign 2022 24.6444 24.644380 (1.000000) 0.000000 (0.000000) 0.008514 0.030619 0.0121881 0.000100202
thread0::GpuMemcpySync:CPU->GPU 407 474.209 240.353833 (0.506852) 233.854844 (0.493148) 0.016517 253.073 1.16513 0.00192809
thread0::fc 2022 318.223 112.565244 (0.353731) 205.657742 (0.646269) 0.138536 0.833114 0.15738 0.00129387
thread0::lookup_table_v2 6066 178.352 90.775715 (0.508971) 87.575820 (0.491029) 0.023212 1.00001 0.0294018 0.000725161
thread0::scale 8088 167.512 138.797163 (0.828579) 28.715087 (0.171421) 0.013476 0.283314 0.0207112 0.000681089
thread0::range 2022 152.811 138.950435 (0.909298) 13.860254 (0.090702) 0.06809 0.146151 0.075574 0.000621314
GpuMemcpySync:GPU->CPU 4044 85.6418 78.242983 (0.913608) 7.398799 (0.086392) 0.018069 0.040667 0.0211775 0.000348212
thread0::elementwise_sub 2022 112.445 102.259070 (0.909416) 10.185748 (0.090584) 0.048334 0.146846 0.0556107 0.00045719
GpuMemcpySync:CPU->GPU 2022 38.93 35.577386 (0.913882) 3.352580 (0.086118) 0.016067 0.028943 0.0192532 0.000158286
thread0::fill_constant 6066 104.941 98.949840 (0.942907) 5.991448 (0.057093) 0.00667 0.60983 0.0172999 0.000426682
thread0::GpuMemcpyAsync:CPU->GPU 8088 100.806 87.153560 (0.864571) 13.652009 (0.135429) 0.004482 0.615412 0.0124636 0.000409866
thread0::select_input 2022 79.2093 75.291260 (0.950536) 3.918028 (0.049464) 0.034303 1.35978 0.0391737 0.000322058
GpuMemcpySync:GPU->CPU 2022 57.9079 53.989840 (0.932340) 3.918028 (0.067660) 0.025363 1.34757 0.0286389 0.000235448
thread0::reshape2 2022 68.883 60.470779 (0.877877) 8.412180 (0.122123) 0.027323 3.01314 0.0340667 0.000280072
GpuMemcpyAsync(same_gpu):GPU->GPU 2022 29.2434 20.831195 (0.712339) 8.412180 (0.287661) 0.012459 0.027904 0.0144626 0.000118901
thread0::squeeze2 2022 61.7655 53.694293 (0.869325) 8.071191 (0.130675) 0.025236 0.071372 0.0305467 0.000251133
GpuMemcpyAsync(same_gpu):GPU->GPU 2022 29.6225 21.551295 (0.727532) 8.071191 (0.272468) 0.012501 0.027986 0.0146501 0.000120442
thread0::cast 2022 56.1319 49.830211 (0.887734) 6.301711 (0.112266) 0.021669 0.078868 0.0277606 0.000228227
thread0::greater_than 2022 48.5322 40.577730 (0.836099) 7.954461 (0.163901) 0.019102 0.078796 0.0240021 0.000197327
thread0::logical_not 2022 43.9498 36.812535 (0.837603) 7.137301 (0.162397) 0.017222 0.057402 0.0217358 0.000178696
thread0::expand_as 2022 42.5417 35.653199 (0.838077) 6.888478 (0.161923) 0.016457 0.07021 0.0210394 0.000172971
thread0::shape 4044 34.1966 34.196558 (1.000000) 0.000000 (0.000000) 0.004634 0.267494 0.00845612 0.00013904
-------------------------> Memory Profiling Report <-------------------------
Event Alloc Calls Size(MB) Free Calls Size(MB)
Place(cpu):conditional_block_infer/assign/compute 7 0.192383 6 0.153198
Place(cpu):range/compute 4044 0.0154266 4044 0.0154266
Place(cpu):conditional_block_infer/scale/compute 2022 43.5502 0 0
Place(cpu):conditional_block_infer/tril_triu/compute 2022 8.14453 0 0
Place(cpu):conditional_block_infer/concat/compute 2022 43.5502 0 0
Place(cpu):conditional_block_infer/fill_constant/compute8088 43.8073 0 0
Place(cpu):fill_constant/compute 2 1.14441e-05 0 0
Place(cpu):select_input 2029 0.200096 2028 0.160912
Place(cpu):conditional_block_infer/unsqueeze2/compute 6066 43.7996 0 0
Place(cpu):shape/compute 2 2.67029e-05 0 0
Place(cpu):conditional_block_infer/set_value/compute 4044 35.655 4044 35.655
Place(cpu):Unknown 407 2071.11 1462 52052.5
Place(cpu):slice/compute 6068 0.0231476 6066 0.02314
Place(cpu):load_combine/compute 1047 49981.3 0 0
Place(cpu):conditional_block_infer 4044 0.00385666 24264 182.856
- 你使用的paddle inference库是多少版本?
- 我看你有使用过debug模式(
config.switch_ir_debug()), 能用visualdl或netron看看经过所有pass之后的图吗(即“_opt_cache”目录里数字最大的那个 xx_xxx.pdmodel) - 可以参考这个 文档 ,用 NVIDIA Nsight Systems 可视化图确认下 : GpuMemcpyAsync:GPU->CPU 主要由哪些算子导致的 ,上面的profile不能确定。
如果在工具使用上有一些问题,方便的话可以加我qq或微信咨询
嗯嗯,感谢,paddle 版本是2.3.0,“_opt_cache” 里面最大的是27_ir_transpose_flatten_concat_fuse_pass.dot。之前尝试用过NVIDIA Nsight Systems,也能可视化计算时间,但好像有点看不懂,因为我知道哪一行代码会产生set_value,就没继续看了,我可以再试试。能问一下您的联系方式么。。
不用看 .dot, 可视化看.pdmodel。 这个是经过pass之后保存的模型图
哦,我看错了,以为前面问有多少个.pdmodel,所以贴上了带数字的那个文件。能加一下您的联系方式么,效率高一些。。
xxx
嗯嗯,感谢
在我的模型中,paddle inference在启动TensorRT进行推断时存在大量CPU与GPU内存之间的拷贝,可以通过更新paddle inference 到 develop 的最新版解决。