benchmark
benchmark copied to clipboard
Optimize inference performance of ERNIE on P40 GPU
负责人
@Xreki @zhaoyuchen2018
初始性能
- 测试时间:2019年8月14日
- 测试者:@Xreki
- GPU平台信息:Tesla P40
- 软件信息:
- Driver Version,418.39
- CUDA 9.0
- cuDNN 7.5
- Paddle commit:
commit 744279fe685dd0b8b426a686d84ad449da02366e
Author: Kevin <[email protected]>
Date: Mon Aug 12 10:13:12 2019 +0800
Refine embedding Api doc (#18820)
- 测试代码:https://github.com/PaddlePaddle/benchmark/pull/164
- 编译Paddle使用Docker镜像:paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
- 编译测试程序,测试使用Docker镜像:paddlepaddle/paddle:latest-gpu-cuda9.0-cudnn7-dev
- 测试结果:
- GPU ratio,96%
- Runtime,8.3554 ms/sample
Profile结果
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::mul 370666 32332.5 12646.616698 (0.391142) 19685.930916 (0.608858) 0.026226 95.3523 0.0872283 0.341415
thread0::elementwise_add 561008 16561.7 14064.876774 (0.849241) 2496.820884 (0.150759) 0.017456 9.10075 0.0295213 0.174883
thread0::softmax 60108 12311.9 4285.399399 (0.348069) 8026.520592 (0.651931) 0.18358 6.97239 0.20483 0.130007
thread0::matmul 125225 7621.9 4999.121478 (0.655889) 2622.776986 (0.344111) 0.026823 6.93527 0.0608656 0.0804832
thread0::layer_norm 125225 6890.62 3494.377954 (0.507121) 3396.237224 (0.492879) 0.046834 6.54613 0.0550259 0.0727612
thread0::transpose2 240432 6706.17 4940.736832 (0.736745) 1765.436248 (0.263255) 0.022342 6.87583 0.0278922 0.0708136
thread0::dropout 190342 4844.09 4380.131806 (0.904222) 463.956297 (0.095778) 0.019757 1.91613 0.0254494 0.051151
thread0::reshape2 240432 1970.44 1970.427168 (0.999993) 0.013691 (0.000007) 0.005761 0.758181 0.00819542 0.0208068
thread0::scale 70126 1559.56 1413.281218 (0.906202) 146.283474 (0.093798) 0.017739 6.40694 0.0222395 0.0164682
thread0::relu 60108 1533.83 1155.052049 (0.753053) 378.773663 (0.246947) 0.020695 1.91584 0.0255178 0.0161964
thread0::stack 5009 670.06 620.412429 (0.925906) 49.647450 (0.074094) 0.104312 3.43668 0.133771 0.00707548
thread0::lookup_table 15027 499.303 409.325315 (0.819793) 89.978163 (0.180207) 0.025125 0.087498 0.0332271 0.00527238
thread0::fetch 5009 260.127 260.068816 (0.999777) 0.057970 (0.000223) 0.044476 1.01272 0.0519319 0.0027468
thread0::TensorCopy:GPU->CPU 5009 210.123 210.076980 (0.999782) 0.045869 (0.000218) 0.036519 0.996315 0.0419491 0.00221879
thread0::GpuMemcpyAsync:CPU->GPU 20036 204.476 173.512282 (0.848571) 30.963505 (0.151429) 0.007 1.77937 0.0102054 0.00215916
thread0::GpuMemcpySync:GPU->CPU 5009 203.349 194.875273 (0.958327) 8.474119 (0.041673) 0.035868 0.330009 0.0405968 0.00214726
thread0::slice 5009 153.525 136.187490 (0.887070) 17.337539 (0.112930) 0.025069 1.98001 0.0306498 0.00162114
thread0::tanh 5009 110.729 100.944808 (0.911638) 9.784189 (0.088362) 0.017612 1.92101 0.022106 0.00116924
thread0::feed 20036 57.2692 57.263221 (0.999896) 0.005970 (0.000104) 0.00156 0.032812 0.00285831 0.000604732
TImeline分析
-
总体来说,GPU利用的比较满。
-
开始时GPU存在不少空闲
-
stack op里面有cuMalloc和cuFree操作,并且有2次cuda stream的同步操作
-
很多的reshape和transpose操作
-
softmax是否使用cudnn?
优化方案
- [ ] 多个lookup_table操作的融合,@Xreki due to 2019年8月30日
- 有一个输入是[00000,11111]这样的id(只有0,1两个取值),有的实现里把这部分换成matmul
- 有一个输入是[00000,11111]这样的id(只有0,1两个取值),有的实现里把这部分换成matmul
- [ ] 确定模型中的reshape操作是否必须的,是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki —— 见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-521229670
- 不能直接移除
- [x] 预测时,dropout可转变成scale,@Xreki due to 2019年8月22日 —— 见https://github.com/PaddlePaddle/Paddle/pull/19297
- [ ] multi-attention head多条分支合并,@zhaoyuchen2018 due to 2019年8月30日
- 从左至右,一共3个分支,分别为:
- 1,
Transpose(Reshape(X*W0+b0))
- 2,
Transpose(Reshape(X*W1+b1))
- 3,
Transpose(Reshape(X*W2+b2))*scale
->Transpose(Reshape(X*W2*scale+b2*scale))
。该分支经过变换之后,结构与第1、2分支完全一样
- 1,
- 可融合变成如下计算:
Transpose(Reshape(X*[W0+W1+W2*scale]+[b0+b1+b2*scale]))
- 只需写fuse的pass,不需新增op。
- 可适配多处计算
- 从左至右,一共3个分支,分别为:
- [x] elementwise + activation的计算,已有pass和融合的op,可能需要增强下 @Xreki
- https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-525648551
- 融合到fc_op中,https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-529325915 ,性能提升2.1% (2019年9月10日)
- FC的GPU kernel:https://github.com/PaddlePaddle/Paddle/pull/19687
- FC的融合Pass:https://github.com/PaddlePaddle/Paddle/pull/19733
-
[ ] elementwise + softmax融合,@zhaoyuchen2018
-
[x] elementwise + layer_norm融合,@Xreki
- fc+elementwise_add+layer_norm融合,https://github.com/PaddlePaddle/Paddle/pull/19776 , https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-530245969 (2019年9月11日)
- fc+elementwise_add+layer_norm融合,https://github.com/PaddlePaddle/Paddle/pull/19776 , https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-530245969 (2019年9月11日)
-
[ ] LN+FC可以融合,融合以后,LN的gamma和beta都可以删除
- FC(LN(X)) => W2*(W1X+B1)+B2 => W2W1X+W2B1+B2
- W2W1、W2B1+B2都可以提前计算出来,该部分最后变成了一个FC
Intel相关的一些工作
- 移除attention模块中的reshape和transpose op,https://github.com/PaddlePaddle/Paddle/pull/16342
- 扩展matmul,以支持multi-head,https://github.com/PaddlePaddle/Paddle/pull/18570
确定模型中的reshape操作是否必须的,是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki
I0814 10:13:53.083331 11085 operator.cc:170] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 768]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 768]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
I0814 10:13:53.083364 11085 operator.cc:993] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0814 10:13:53.083400 11085 tensor_util.cu:28] TensorCopy 1, 128, 12, 64 from CPUPlace to CPUPlace
I0814 10:13:53.083407 11085 tensor_util.cu:43] Skip copy the same data async from CPUPlace to CPUPlace
I0814 10:13:53.083420 11085 operator.cc:190] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 12, 64]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 12, 64]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
- 从详细的log来看,reshape2输入的dims是[1, 128, 768],输出dims是[1, 128, 12, 64],因此不能直接移除。
- 因为reshape2是inplace执行的,导致图中展示输入的dims不正确。
结论:不能直接移除
优化效果汇总
- 单位:ms/sample
版本 | P40时间 (Total) | 时间 (去掉第一次run) | 测试时间 | PR | 版本描述 | 加速效果 |
---|---|---|---|---|---|---|
0 | 8.36 | 2019-08-14 | - | baseline | - | |
1 (merged) | 8.22 | 8.14 | 2019-08-22 | https://github.com/PaddlePaddle/Paddle/pull/19297 | 移除dropout_op | 1.7% |
2 (merged) | 7.37 | 6.96 | 2019-08-22 | https://github.com/PaddlePaddle/Paddle/pull/19346 | softmax使用cudnn | 10% |
3 (merged) | 7.20 | 6.82 | 2019-09-10 | https://github.com/PaddlePaddle/Paddle/pull/19687 https://github.com/PaddlePaddle/Paddle/pull/19733 |
融合FC计算,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-529325915 | 2.1% |
4 (merged) | 6.93 | 6.55 | 2019-09-11 | https://github.com/PaddlePaddle/Paddle/pull/19776 | 融合FC+elementwise_add+layer_norm,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-530245969 | 4% |
5 (merged) | 6.13 | 5.73 | 2019-09-19 | https://github.com/PaddlePaddle/Paddle/pull/20167 https://github.com/PaddlePaddle/Paddle/pull/20151 | 融合multi-head attention,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-555797150 | 14% |
3 | 7.26 | 6.88 | 2019-09-03 | https://github.com/PaddlePaddle/Paddle/pull/19610 | 优化stack_op,见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-527316205 | 1.2% |
在版本1的基础上,在预测了使用了fuse_elewise_add_act_pass
,可以融合elementwise_add
+relu
的计算,并成功匹配到14个子图,但总体性能下降。
+ /paddle/build_paddle/build_docker_manylinux_cuda90/paddle/fluid/inference/tests/api/samples/ernie_tester --logtostderr --model_dir=/data/ernie/model --data=/data/ernie/seq128_data/test_ds --repeat=1 --warmup_steps=1 --use_gpu=true --use_analysis=true --print_outputs=false --profile=false
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ernie
[ RUN ] ernie.profile
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [fuse_elewise_add_act_pass]
--- detected 14 subgraphs
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0828 08:49:37.436837 7631 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0828 08:49:37.726634 7631 analysis_predictor.cc:475] == optimize end ==
I0828 08:49:38.710613 7631 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W0828 08:49:39.109542 7631 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
W0828 08:49:39.115780 7631 device_context.cc:269] device: 0, cuDNN Version: 7.3.
W0828 08:49:39.116070 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.116091 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.116104 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.127769 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.127822 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.127832 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.137923 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.137972 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.137981 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.147920 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.147967 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.147977 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.157951 7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.157996 7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.158005 7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
I0828 08:50:21.810214 7631 ernie_tester.cc:345] Run 5010 samples, average latency: 8.60096 ms per sample.
I0828 08:50:21.810283 7631 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 8.51942 ms per sample.
[ OK ] ernie.profile (46355 ms)
[----------] 1 test from ernie (46355 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (46356 ms total)
[ PASSED ] 1 test.
融合3个path的fc,reshape,transpose,scale到一个path. P40上测试结果如下:
融合后的图如下所示:
融合前:
profile结果
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::mul 250450 30086.2 10700.311709 (0.355655) 19385.885160 (0.644345) 0.027581 88.3573 0.120129 0.39598
thread0::elementwise_add 440792 13893.1 11593.975031 (0.834514) 2299.111468 (0.165486) 0.019193 12.8221 0.0315185 0.182854
thread0::matmul 125225 7716.83 5090.761228 (0.659696) 2626.072493 (0.340304) 0.02993 7.32385 0.0616237 0.101565
thread0::layer_norm 125225 7091.46 3754.750949 (0.529475) 3336.704465 (0.470525) 0.048728 7.12371 0.0566297 0.0933344
thread0::softmax 60108 4259.71 1991.152624 (0.467438) 2268.559610 (0.532562) 0.063841 9.24836 0.0708676 0.0560643
thread0::transpose2 120216 4126.7 2756.571446 (0.667985) 1370.123630 (0.332015) 0.025956 9.50509 0.0343273 0.0543136
thread0::split 60108 2998.5 1830.882725 (0.610599) 1167.620505 (0.389401) 0.043009 19.4484 0.0498853 0.0394649
thread0::relu 60108 1620.75 1233.311364 (0.760953) 387.433865 (0.239047) 0.021955 9.30877 0.0269639 0.0213315
thread0::reshape2 120216 1279.38 1279.361341 (0.999987) 0.016020 (0.000013) 0.008744 1.83394 0.0106423 0.0168386
thread0::stack 5009 673.79 610.591735 (0.906205) 63.198156 (0.093795) 0.110177 11.1131 0.134516 0.0088681
thread0::lookup_table 15027 517.377 424.928699 (0.821314) 92.448017 (0.178686) 0.025679 6.76539 0.0344298 0.00680947
thread0::fetch 5009 342.348 342.270593 (0.999773) 0.077819 (0.000227) 0.057001 1.4229 0.0683467 0.00450583
thread0::TensorCopy:GPU->CPU 5009 261.036 260.979443 (0.999784) 0.056311 (0.000216) 0.04429 0.284401 0.0521133 0.00343563
thread0::scale 10018 253.319 237.712034 (0.938391) 15.606845 (0.061609) 0.020368 0.161032 0.0252864 0.00333406
thread0::GpuMemcpyAsync:CPU->GPU 20036 248.089 208.310791 (0.839661) 39.778466 (0.160339) 0.008811 1.84374 0.0123822 0.00326523
thread0::GpuMemcpySync:GPU->CPU 5009 241.886 232.381808 (0.960708) 9.504255 (0.039292) 0.040716 0.390521 0.0482903 0.00318359
thread0::slice 5009 160.703 143.249188 (0.891391) 17.453703 (0.108609) 0.02768 0.07693 0.0320828 0.0021151
thread0::tanh 5009 117.589 107.550767 (0.914632) 10.038396 (0.085368) 0.01952 0.884425 0.0234756 0.00154765
thread0::feed 20036 90.2748 90.263520 (0.999875) 0.011311 (0.000125) 0.002949 0.964233 0.00450563 0.00118815
stack
与concat
的功能类似。尝试:
方法一:用concat
代替stack
,运行出现错误,发现stack
和concat
输出的维度不一样,不可行
-
stack
输出的维度
I0820 09:17:21.934635 17390 operator.cc:168] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.934787 17390 operator.cc:188] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
.
.
.
I0820 09:17:21.935680 17390 operator.cc:168] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.935708 17390 operator.cc:188] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
-
concat
输出的维度
I0903 05:49:33.774837 17211 operator.cc:165] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:[0]({{}})]}.
I0903 05:49:33.774863 17211 operator.cc:988] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0903 05:49:33.774976 17211 operator.cc:185] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:float[1, 1536, 128]({})]}.
.
.
.
I0903 05:49:33.775012 17211 operator.cc:165] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 1536, 128]({})]}, outputs:{Out[tmp_2:[0]({{}})]}.
方法二:参考concat
,重新实现stack
I0903 09:57:36.458448 29283 ernie_tester.cc:345] Run 5010 samples, average latency: 7.26373 ms per sample.
I0903 09:57:36.458526 29283 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.87974 ms per sample.
以上测试结果基于版本2。
I0920 09:15:41.505820 25415 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92085 ms per sample.
I0920 09:15:41.505903 25415 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.51521 ms per sample.
以上测试结果基于版本4。
实现fc的GPU kernel :https://github.com/PaddlePaddle/Paddle/pull/19687 FC的融合:https://github.com/PaddlePaddle/Paddle/pull/19733
- fc=mul+elementwise_add+relu,主要是将fc中的elementwise_add和relu计算融合起来。
- 测试结果,加速效果:2.1%
--- Running IR pass [fc_fuse_pass]
--- detected 12 subgraphs
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0909 06:52:40.161475 7826 fc_fuse_pass.cc:122] with_relu: 1, found_fc_count: 12
--- detected 62 subgraphs
I0909 06:52:40.214682 7826 fc_fuse_pass.cc:122] with_relu: 0, found_fc_count: 74
.
.
.
I0909 06:51:05.903086 7821 ernie_tester.cc:345] Run 5010 samples, average latency: 7.19529 ms per sample.
I0909 06:51:05.903168 7821 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.81763 ms per sample.
- Profile
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fc 370666 37927.8 16796.211849 (0.442847) 21131.552513 (0.557153) 0.038417 81.8839 0.102323 0.511083
thread0::matmul 125225 7542.57 4853.428136 (0.643472) 2689.138129 (0.356528) 0.028322 6.91776 0.0602321 0.101637
thread0::layer_norm 125225 6785.64 3458.361240 (0.509659) 3327.277901 (0.490341) 0.046506 6.78461 0.0541876 0.0914376
thread0::transpose2 240432 6648.82 4993.952105 (0.751103) 1654.869939 (0.248897) 0.022683 6.97236 0.0276536 0.089594
thread0::elementwise_add 190342 4846.03 3903.958820 (0.805599) 942.072546 (0.194401) 0.018709 7.06821 0.0254596 0.0653011
thread0::softmax 60108 4155.86 1873.948821 (0.450917) 2281.914897 (0.549083) 0.062806 7.22235 0.0691399 0.056001
thread0::reshape2 240432 2117.24 2117.212819 (0.999986) 0.030675 (0.000014) 0.006925 2.54413 0.008806 0.0285302
thread0::scale 70126 1639.13 1402.350356 (0.855547) 236.777127 (0.144453) 0.017914 3.11279 0.023374 0.0220875
thread0::stack 5009 651.86 601.077905 (0.922097) 50.782113 (0.077903) 0.106259 4.12809 0.130138 0.00878393
thread0::lookup_table 15027 479.71 388.084297 (0.808999) 91.625230 (0.191001) 0.024586 0.085041 0.0319232 0.00646417
thread0::fetch 5009 333.739 333.664085 (0.999776) 0.074696 (0.000224) 0.054763 2.98519 0.0666278 0.00449719
thread0::TensorCopy:GPU->CPU 5009 264.276 264.216676 (0.999777) 0.059003 (0.000223) 0.043246 2.97079 0.0527602 0.00356116
thread0::GpuMemcpySync:GPU->CPU 5009 244.643 235.804134 (0.963870) 8.839005 (0.036130) 0.04009 1.58306 0.0488407 0.00329661
thread0::GpuMemcpyAsync:CPU->GPU 20036 226.793 195.272127 (0.861014) 31.521131 (0.138986) 0.007882 1.17277 0.0113193 0.00305608
thread0::slice 5009 147.886 130.195979 (0.880383) 17.689569 (0.119617) 0.025414 0.103978 0.029524 0.00199278
thread0::tanh 5009 122.009 112.002519 (0.917989) 10.006004 (0.082011) 0.019717 0.094528 0.0243579 0.00164409
thread0::feed 20036 76.5896 76.581383 (0.999892) 0.008248 (0.000108) 0.002877 0.308487 0.0038226 0.00103206
fc+elementwise_add+layer_norm融合
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
--- detected 24 subgraphs
.
.
.
I0911 06:41:55.494792 7238 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92849 ms per sample.
I0911 06:41:55.494873 7238 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.54723 ms per sample.
- Profile
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fc 255459 24269.7 12361.669096 (0.509346) 11908.043273 (0.490654) 0.039433 100.758 0.0950043 0.346056
thread0::fused_fc_elementwise_layernorm 115207 17699.9 6439.933305 (0.363841) 11259.921061 (0.636159) 0.103025 6.90034 0.153635 0.252378
thread0::matmul 125225 7670.31 4985.129880 (0.649925) 2685.183125 (0.350075) 0.029412 6.91239 0.0612522 0.109369
thread0::transpose2 240432 6889.05 5254.959850 (0.762799) 1634.087984 (0.237201) 0.023907 6.72677 0.0286528 0.0982293
thread0::softmax 60108 4204.46 1927.686603 (0.458486) 2276.771181 (0.541514) 0.064249 2.83378 0.0699484 0.0599504
thread0::reshape2 240432 2228.86 2228.844417 (0.999994) 0.014348 (0.000006) 0.007575 3.75591 0.00927023 0.0317808
thread0::elementwise_add 75135 2160.14 1641.503732 (0.759904) 518.641058 (0.240096) 0.019577 6.72642 0.0287502 0.030801
thread0::scale 70126 1736.05 1498.515875 (0.863178) 237.529504 (0.136822) 0.018853 6.66338 0.0247561 0.0247539
thread0::stack 5009 664.48 615.254962 (0.925920) 49.224601 (0.074080) 0.10776 3.887 0.132657 0.00947466
thread0::layer_norm 10018 592.675 324.577447 (0.547648) 268.097787 (0.452352) 0.050985 6.49622 0.059161 0.00845082
thread0::lookup_table 15027 499.198 412.769395 (0.826866) 86.428197 (0.173134) 0.025502 6.66372 0.03322 0.00711794
thread0::fetch 5009 362.799 362.676040 (0.999660) 0.123295 (0.000340) 0.05508 0.625043 0.0724295 0.00517307
thread0::TensorCopy:GPU->CPU 5009 297.886 297.782095 (0.999652) 0.103633 (0.000348) 0.043194 0.612426 0.0594701 0.00424748
thread0::GpuMemcpySync:GPU->CPU 5009 278.55 270.104589 (0.969679) 8.445858 (0.030321) 0.039548 0.609273 0.05561 0.00397179
thread0::GpuMemcpyAsync:CPU->GPU 20036 230.007 198.920144 (0.864846) 31.086396 (0.135154) 0.008387 0.708778 0.0114797 0.00327961
thread0::slice 5009 152.385 135.526127 (0.889365) 16.859146 (0.110635) 0.026901 0.06877 0.0304223 0.00217283
thread0::tanh 5009 122.456 112.432469 (0.918148) 10.023237 (0.081852) 0.020671 0.070096 0.0244471 0.00174607
thread0::feed 20036 73.4209 73.411569 (0.999873) 0.009355 (0.000127) 0.002908 0.042732 0.00366445 0.00104689
multi-head attention融合
-
Before
-
After
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
I1120 01:40:10.299883 6510 graph_pattern_detector.cc:97] --- detected 12 subgraphs
--- Running IR pass [fc_fuse_pass]
I1120 01:40:10.344569 6510 graph_pattern_detector.cc:97] --- detected 12 subgraphs
I1120 01:40:10.357105 6510 graph_pattern_detector.cc:97] --- detected 26 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1120 01:40:10.397455 6510 graph_pattern_detector.cc:97] --- detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1120 01:40:10.509421 6510 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I1120 01:40:10.758424 6510 analysis_predictor.cc:474] ======= optimize end =======
I1120 01:40:11.764863 6510 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W1120 01:40:12.444176 6510 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
.
.
.
I1120 01:40:43.862216 6510 ernie_tester.cc:345] Run 5010 samples, average latency: 6.40495 ms per sample.
I1120 01:40:43.862303 6510 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 5.74544 ms per sample.
- Profile
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fused_fc_elementwise_layernorm 115207 17021.4 5933.953144 (0.348617) 11087.450962 (0.651383) 0.09927 8.71612 0.147746 0.259676
thread0::multihead_matmul 60108 11252.9 7224.504494 (0.642012) 4028.405127 (0.357988) 0.169878 6.57812 0.187212 0.171673
thread0::fc 75135 10707.1 4357.612778 (0.406985) 6349.445604 (0.593015) 0.038466 85.6766 0.142504 0.163345
thread0::mul 180324 9774.02 4994.708598 (0.511019) 4779.307233 (0.488981) 0.047804 9.68787 0.0542025 0.149111
thread0::fetch 5009 4630.95 4629.392276 (0.999664) 1.554398 (0.000336) 0.055675 1.92049 0.924525 0.0706489
thread0::TensorCopy:GPU->CPU 5009 4573.44 4571.905707 (0.999664) 1.538373 (0.000336) 0.04468 1.90973 0.913045 0.0697717
thread0::GpuMemcpySync:GPU->CPU 5009 4553.78 4543.954556 (0.997843) 9.821141 (0.002157) 0.040962 1.9062 0.909119 0.0694716
thread0::stack 5009 684.909 634.119137 (0.925844) 50.789981 (0.074156) 0.106366 2.25957 0.136736 0.0104489
thread0::layer_norm 10018 549.522 290.654864 (0.528923) 258.867095 (0.471077) 0.048854 0.142508 0.0548535 0.00838342
thread0::lookup_table 15027 479.099 392.001286 (0.818205) 87.097997 (0.181795) 0.024349 9.00457 0.0318826 0.00730906
thread0::elementwise_add 15027 345.643 302.006333 (0.873752) 43.636814 (0.126248) 0.017625 3.1705 0.0230015 0.00527307
thread0::GpuMemcpyAsync:CPU->GPU 20036 229.869 198.181250 (0.862150) 31.687294 (0.137850) 0.008188 2.08116 0.0114728 0.00350684
thread0::scale 10018 224.898 210.406303 (0.935565) 14.491370 (0.064435) 0.018757 0.143304 0.0224494 0.003431
thread0::slice 5009 169.833 151.921321 (0.894535) 17.911378 (0.105465) 0.030184 0.072767 0.0339055 0.00259094
thread0::matmul 5009 163.061 149.402692 (0.916239) 13.658130 (0.083761) 0.028052 0.956259 0.0325536 0.00248763
thread0::tanh 5009 116.672 106.307159 (0.911166) 10.364377 (0.088834) 0.020194 0.067294 0.0232924 0.00177992
thread0::feed 20036 71.6392 71.630728 (0.999882) 0.008476 (0.000118) 0.002808 0.034718 0.00357552 0.00109292
- Timeline:fetch是在等待GPU计算完成
QA测试结果
- 测试时间:2019年12月24日
- 测试环境:
- CUDA 9.0
- CUDNN 7.3
- Paddle预测库版本: a5a8d14414213fadcfcd7dc60c794d1a515a390e
- 测试镜像: paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
- GPU: Tesla P4 8G(7611MB)
- GCC版本:镜像gcc482
- 测试结论:
- ERNIE-GPU预测性能符合预期
- GPU和CPU对相同数据的预测结果diff<1e-5
- GPU同⼀个数据集反复预测,预测结果diff<1e-6
- 性能数据:
-
Seqlen128-data
batch_size average_latency(ms) batch_latency(ms) gpu_memory(MB) 1 8.664 8.664 737 10 6.388 63.878 825 20 6.207 124.149 919 30 6.530 195.892 985 40 6.456 258.260 1085 50 6.445 322.275 1135 60 6.430 385.805 1279 70 6.457 451.965 1283 80 6.415 513.178 1339 90 6.481 583.301 1431 100 6.504 650.428 1485 -
Seqlen512-data
batch_size average_latency(ms) batch_latency(ms) gpu_memory(MB) 1 36.123 36.123 781 10 31.663 316.630 1191 20 31.750 635.002 1637 30 32.280 968.412 2053 40 32.130 1285.212 2431 50 32.917 1645.835 2847 100 34.254 3425.380 4951 200 显存不⾜ 300 显存不⾜
-
二期优化工作
-
负责人: @NHZlX
-
GPU平台信息:Tesla P4
-
软件信息:
- CUDA 10.0
-
优化工作
版本 | P4时间 (ms) | 记录时间 | PR | 版本描述 | 加速效果 |
---|---|---|---|---|---|
0 | 10 | 2020-02-04 | - | baseline,CUDA10的时间比CUDA9的时间长 | - |
1 | 9.5 | 2020-02-04 | ~~https://github.com/PaddlePaddle/Paddle/pull/22424~~ https://github.com/PaddlePaddle/Paddle/pull/22486 |
将3个fc也融合到了multihead_matmul_op中,将3个小的GEMM合成一个大的GEMM计算 | 5% |
2 | 8.4 | 2020-02-04 | ~~https://github.com/PaddlePaddle/Paddle/pull/22424~~ https://github.com/PaddlePaddle/Paddle/pull/22494 |
融合3个lookup_table+elementwise_add+layer_norm,实现fused_embedding_eltwise_layernorm_op | 11.6% |
3 | 7.8 | 2020-02-05 | fused_fc_elementwise_layernorm_op中计算layer_norm,使用double类型进行block内的reduce操作,改成float类型 |