负责人

@Xreki @zhaoyuchen2018

初始性能

测试时间：2019年8月14日
测试者：@Xreki
GPU平台信息：Tesla P40
软件信息：
- Driver Version，418.39
- CUDA 9.0
- cuDNN 7.5
Paddle commit：

commit 744279fe685dd0b8b426a686d84ad449da02366e
Author: Kevin <[email protected]>
Date:   Mon Aug 12 10:13:12 2019 +0800

    Refine embedding Api doc (#18820)

测试代码：https://github.com/PaddlePaddle/benchmark/pull/164
编译Paddle使用Docker镜像：paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
编译测试程序，测试使用Docker镜像：paddlepaddle/paddle:latest-gpu-cuda9.0-cudnn7-dev
测试结果：
- GPU ratio，96%
- Runtime，8.3554 ms/sample

NVIDIA BERT推理解决方案Faster Transformer开源了

Aug 14 '19 06:08 Xreki

------------------------->

Place: All Time unit: ms Sorted by total time in descending

Event                               Calls thread0::mul thread0::elementwise_add thread0::softmax                    60108 thread0::matmul thread0::layer_norm thread0::transpose2 thread0::dropout thread0::reshape2 thread0::scale                      70126 thread0::relu                       60108 thread0::stack                      5009 thread0::lookup_table               15027 thread0::fetch                      5009 thread0::TensorCopy:GPU->CPU thread0::GpuMemcpyAsync:CPU->GPU    20036 thread0::GpuMemcpySync:GPU->CPU     5009 thread0::slice                      5009 thread0::tanh                       5009 thread0::feed                       20036

Profile结果

Profiling Report <------------------------- order in the same thread Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. 370666 32332.5 12646.616698 (0.391142) 19685.930916 (0.608858) 0.026226 95.3523 0.0872283 0.341415 561008 16561.7 14064.876774 (0.849241) 2496.820884 (0.150759) 0.017456 9.10075 0.0295213 0.174883 12311.9 4285.399399 (0.348069) 8026.520592 (0.651931) 0.18358 6.97239 0.20483 0.130007 125225 7621.9 4999.121478 (0.655889) 2622.776986 (0.344111) 0.026823 6.93527 0.0608656 0.0804832 125225 6890.62 3494.377954 (0.507121) 3396.237224 (0.492879) 0.046834 6.54613 0.0550259 0.0727612 240432 6706.17 4940.736832 (0.736745) 1765.436248 (0.263255) 0.022342 6.87583 0.0278922 0.0708136 190342 4844.09 4380.131806 (0.904222) 463.956297 (0.095778) 0.019757 1.91613 0.0254494 0.051151 240432 1970.44 1970.427168 (0.999993) 0.013691 (0.000007) 0.005761 0.758181 0.00819542 0.0208068 1559.56 1413.281218 (0.906202) 146.283474 (0.093798) 0.017739 6.40694 0.0222395 0.0164682 1533.83 1155.052049 (0.753053) 378.773663 (0.246947) 0.020695 1.91584 0.0255178 0.0161964 670.06 620.412429 (0.925906) 49.647450 (0.074094) 0.104312 3.43668 0.133771 0.00707548 499.303 409.325315 (0.819793) 89.978163 (0.180207) 0.025125 0.087498 0.0332271 0.00527238 260.127 260.068816 (0.999777) 0.057970 (0.000223) 0.044476 1.01272 0.0519319 0.0027468 5009 210.123 210.076980 (0.999782) 0.045869 (0.000218) 0.036519 0.996315 0.0419491 0.00221879 204.476 173.512282 (0.848571) 30.963505 (0.151429) 0.007 1.77937 0.0102054 0.00215916 203.349 194.875273 (0.958327) 8.474119 (0.041673) 0.035868 0.330009 0.0405968 0.00214726 153.525 136.187490 (0.887070) 17.337539 (0.112930) 0.025069 1.98001 0.0306498 0.00162114 110.729 100.944808 (0.911638) 9.784189 (0.088362) 0.017612 1.92101 0.022106 0.00116924 57.2692 57.263221 (0.999896) 0.005970 (0.000104) 0.00156 0.032812 0.00285831 0.000604732

Aug 14 '19 07:08 Xreki

TImeline分析

总体来说，GPU利用的比较满。
开始时GPU存在不少空闲
stack op里面有cuMalloc和cuFree操作，并且有2次cuda stream的同步操作
很多的reshape和transpose操作
softmax是否使用cudnn？

Aug 14 '19 07:08 Xreki

优化方案

[ ] 多个lookup_table操作的融合，@Xreki due to 2019年8月30日
- 有一个输入是[00000,11111]这样的id(只有0，1两个取值)，有的实现里把这部分换成matmul
[ ] 确定模型中的reshape操作是否必须的，是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki —— 见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-521229670
- 不能直接移除

[x] 预测时，dropout可转变成scale，@Xreki due to 2019年8月22日 —— 见https://github.com/PaddlePaddle/Paddle/pull/19297
[ ] multi-attention head多条分支合并，@zhaoyuchen2018 due to 2019年8月30日
- 从左至右，一共3个分支，分别为：
  - 1，Transpose(Reshape(X*W0+b0))
  - 2，Transpose(Reshape(X*W1+b1))
  - 3，Transpose(Reshape(X*W2+b2))*scale -> Transpose(Reshape(X*W2*scale+b2*scale))。该分支经过变换之后，结构与第1、2分支完全一样
- 可融合变成如下计算：Transpose(Reshape(X*[W0+W1+W2*scale]+[b0+b1+b2*scale]))
- 只需写fuse的pass，不需新增op。
- 可适配多处计算

[x] elementwise + activation的计算，已有pass和融合的op，可能需要增强下 @Xreki
- https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-525648551
- 融合到fc_op中，https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-529325915 ，性能提升2.1% （2019年9月10日）
  - FC的GPU kernel：https://github.com/PaddlePaddle/Paddle/pull/19687
  - FC的融合Pass：https://github.com/PaddlePaddle/Paddle/pull/19733

[ ] elementwise + softmax融合，@zhaoyuchen2018
[x] elementwise + layer_norm融合，@Xreki
- fc+elementwise_add+layer_norm融合，https://github.com/PaddlePaddle/Paddle/pull/19776 ， https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-530245969 （2019年9月11日）
[ ] LN+FC可以融合，融合以后，LN的gamma和beta都可以删除
- FC(LN(X)) => W2*(W1X+B1)+B2 => W2W1X+W2B1+B2
- W2W1、W2B1+B2都可以提前计算出来，该部分最后变成了一个FC

Aug 14 '19 09:08 Xreki

Intel相关的一些工作

移除attention模块中的reshape和transpose op，https://github.com/PaddlePaddle/Paddle/pull/16342
扩展matmul，以支持multi-head，https://github.com/PaddlePaddle/Paddle/pull/18570

Aug 14 '19 09:08 Xreki

确定模型中的reshape操作是否必须的，是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki

I0814 10:13:53.083331 11085 operator.cc:170] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 768]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 768]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
I0814 10:13:53.083364 11085 operator.cc:993] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0814 10:13:53.083400 11085 tensor_util.cu:28] TensorCopy 1, 128, 12, 64 from CPUPlace to CPUPlace
I0814 10:13:53.083407 11085 tensor_util.cu:43] Skip copy the same data async from CPUPlace to CPUPlace
I0814 10:13:53.083420 11085 operator.cc:190] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 12, 64]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 12, 64]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.

从详细的log来看，reshape2输入的dims是[1, 128, 768]，输出dims是[1, 128, 12, 64]，因此不能直接移除。
因为reshape2是inplace执行的，导致图中展示输入的dims不正确。

结论：不能直接移除

Aug 14 '19 12:08 Xreki

优化效果汇总

单位：ms/sample

版本	P40时间 (Total)	时间 (去掉第一次run)	测试时间	PR	版本描述	加速效果
0	8.36		2019-08-14	-	baseline	-
1 (merged)	8.22	8.14	2019-08-22	https://github.com/PaddlePaddle/Paddle/pull/19297	移除dropout_op	1.7%
2 (merged)	7.37	6.96	2019-08-22	https://github.com/PaddlePaddle/Paddle/pull/19346	softmax使用cudnn	10%
3 (merged)	7.20	6.82	2019-09-10	https://github.com/PaddlePaddle/Paddle/pull/19687 https://github.com/PaddlePaddle/Paddle/pull/19733	融合FC计算，见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-529325915	2.1%
4 (merged)	6.93	6.55	2019-09-11	https://github.com/PaddlePaddle/Paddle/pull/19776	融合FC+elementwise_add+layer_norm，见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-530245969	4%
5 (merged)	6.13	5.73	2019-09-19	https://github.com/PaddlePaddle/Paddle/pull/20167 https://github.com/PaddlePaddle/Paddle/pull/20151	融合multi-head attention，见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-555797150	14%
3	7.26	6.88	2019-09-03	https://github.com/PaddlePaddle/Paddle/pull/19610	优化stack_op，见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-527316205	1.2%

Aug 22 '19 03:08 Xreki

在版本1的基础上，在预测了使用了fuse_elewise_add_act_pass，可以融合elementwise_add+relu的计算，并成功匹配到14个子图，但总体性能下降。

+ /paddle/build_paddle/build_docker_manylinux_cuda90/paddle/fluid/inference/tests/api/samples/ernie_tester --logtostderr --model_dir=/data/ernie/model --data=/data/ernie/seq128_data/test_ds --repeat=1 --warmup_steps=1 --use_gpu=true --use_analysis=true --print_outputs=false --profile=false
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ernie
[ RUN      ] ernie.profile
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [fuse_elewise_add_act_pass]
---  detected 14 subgraphs
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0828 08:49:37.436837  7631 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0828 08:49:37.726634  7631 analysis_predictor.cc:475] == optimize end ==
I0828 08:49:38.710613  7631 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W0828 08:49:39.109542  7631 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
W0828 08:49:39.115780  7631 device_context.cc:269] device: 0, cuDNN Version: 7.3.
W0828 08:49:39.116070  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.116091  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.116104  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.127769  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.127822  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.127832  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.137923  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.137972  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.137981  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.147920  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.147967  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.147977  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.157951  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.157996  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.158005  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
I0828 08:50:21.810214  7631 ernie_tester.cc:345] Run 5010 samples, average latency: 8.60096 ms per sample.
I0828 08:50:21.810283  7631 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 8.51942 ms per sample.
[       OK ] ernie.profile (46355 ms)
[----------] 1 test from ernie (46355 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (46356 ms total)
[  PASSED  ] 1 test.

Aug 28 '19 08:08 Xreki

融合3个path的fc，reshape，transpose，scale到一个path. P40上测试结果如下:

融合后的图如下所示：融合前：

profile结果

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::mul                        250450      30086.2     10700.311709 (0.355655) 19385.885160 (0.644345) 0.027581    88.3573     0.120129    0.39598
thread0::elementwise_add            440792      13893.1     11593.975031 (0.834514) 2299.111468 (0.165486)  0.019193    12.8221     0.0315185   0.182854
thread0::matmul                     125225      7716.83     5090.761228 (0.659696)  2626.072493 (0.340304)  0.02993     7.32385     0.0616237   0.101565
thread0::layer_norm                 125225      7091.46     3754.750949 (0.529475)  3336.704465 (0.470525)  0.048728    7.12371     0.0566297   0.0933344
thread0::softmax                    60108       4259.71     1991.152624 (0.467438)  2268.559610 (0.532562)  0.063841    9.24836     0.0708676   0.0560643
thread0::transpose2                 120216      4126.7      2756.571446 (0.667985)  1370.123630 (0.332015)  0.025956    9.50509     0.0343273   0.0543136
thread0::split                      60108       2998.5      1830.882725 (0.610599)  1167.620505 (0.389401)  0.043009    19.4484     0.0498853   0.0394649
thread0::relu                       60108       1620.75     1233.311364 (0.760953)  387.433865 (0.239047)   0.021955    9.30877     0.0269639   0.0213315
thread0::reshape2                   120216      1279.38     1279.361341 (0.999987)  0.016020 (0.000013)     0.008744    1.83394     0.0106423   0.0168386
thread0::stack                      5009        673.79      610.591735 (0.906205)   63.198156 (0.093795)    0.110177    11.1131     0.134516    0.0088681
thread0::lookup_table               15027       517.377     424.928699 (0.821314)   92.448017 (0.178686)    0.025679    6.76539     0.0344298   0.00680947
thread0::fetch                      5009        342.348     342.270593 (0.999773)   0.077819 (0.000227)     0.057001    1.4229      0.0683467   0.00450583
thread0::TensorCopy:GPU->CPU        5009        261.036     260.979443 (0.999784)   0.056311 (0.000216)     0.04429     0.284401    0.0521133   0.00343563
thread0::scale                      10018       253.319     237.712034 (0.938391)   15.606845 (0.061609)    0.020368    0.161032    0.0252864   0.00333406
thread0::GpuMemcpyAsync:CPU->GPU    20036       248.089     208.310791 (0.839661)   39.778466 (0.160339)    0.008811    1.84374     0.0123822   0.00326523
thread0::GpuMemcpySync:GPU->CPU     5009        241.886     232.381808 (0.960708)   9.504255 (0.039292)     0.040716    0.390521    0.0482903   0.00318359
thread0::slice                      5009        160.703     143.249188 (0.891391)   17.453703 (0.108609)    0.02768     0.07693     0.0320828   0.0021151
thread0::tanh                       5009        117.589     107.550767 (0.914632)   10.038396 (0.085368)    0.01952     0.884425    0.0234756   0.00154765
thread0::feed                       20036       90.2748     90.263520 (0.999875)    0.011311 (0.000125)     0.002949    0.964233    0.00450563  0.00118815

Aug 29 '19 07:08 zhaoyuchen2018

stack与concat的功能类似。尝试：

方法一：用`concat`代替`stack`，运行出现错误，发现`stack`和`concat`输出的维度不一样，不可行

stack输出的维度

I0820 09:17:21.934635 17390 operator.cc:168] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.934787 17390 operator.cc:188] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
.
.
.
I0820 09:17:21.935680 17390 operator.cc:168] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.935708 17390 operator.cc:188] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.

concat输出的维度

I0903 05:49:33.774837 17211 operator.cc:165] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:[0]({{}})]}.
I0903 05:49:33.774863 17211 operator.cc:988] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0903 05:49:33.774976 17211 operator.cc:185] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:float[1, 1536, 128]({})]}.
.
.
.
I0903 05:49:33.775012 17211 operator.cc:165] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 1536, 128]({})]}, outputs:{Out[tmp_2:[0]({{}})]}.

方法二：参考`concat`，重新实现`stack`

I0903 09:57:36.458448 29283 ernie_tester.cc:345] Run 5010 samples, average latency: 7.26373 ms per sample.
I0903 09:57:36.458526 29283 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.87974 ms per sample.

以上测试结果基于版本2。

I0920 09:15:41.505820 25415 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92085 ms per sample.
I0920 09:15:41.505903 25415 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.51521 ms per sample.

以上测试结果基于版本4。

Sep 03 '19 06:09 Xreki

实现fc的GPU kernel ：https://github.com/PaddlePaddle/Paddle/pull/19687 FC的融合：https://github.com/PaddlePaddle/Paddle/pull/19733

fc=mul+elementwise_add+relu，主要是将fc中的elementwise_add和relu计算融合起来。
测试结果，加速效果：2.1%

--- Running IR pass [fc_fuse_pass]
---  detected 12 subgraphs
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0909 06:52:40.161475  7826 fc_fuse_pass.cc:122] with_relu: 1, found_fc_count: 12
---  detected 62 subgraphs
I0909 06:52:40.214682  7826 fc_fuse_pass.cc:122] with_relu: 0, found_fc_count: 74
.
.
.
I0909 06:51:05.903086  7821 ernie_tester.cc:345] Run 5010 samples, average latency: 7.19529 ms per sample.
I0909 06:51:05.903168  7821 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.81763 ms per sample.

Profile

------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fc                         370666      37927.8     16796.211849 (0.442847) 21131.552513 (0.557153) 0.038417    81.8839     0.102323    0.511083
thread0::matmul                     125225      7542.57     4853.428136 (0.643472)  2689.138129 (0.356528)  0.028322    6.91776     0.0602321   0.101637
thread0::layer_norm                 125225      6785.64     3458.361240 (0.509659)  3327.277901 (0.490341)  0.046506    6.78461     0.0541876   0.0914376
thread0::transpose2                 240432      6648.82     4993.952105 (0.751103)  1654.869939 (0.248897)  0.022683    6.97236     0.0276536   0.089594
thread0::elementwise_add            190342      4846.03     3903.958820 (0.805599)  942.072546 (0.194401)   0.018709    7.06821     0.0254596   0.0653011
thread0::softmax                    60108       4155.86     1873.948821 (0.450917)  2281.914897 (0.549083)  0.062806    7.22235     0.0691399   0.056001
thread0::reshape2                   240432      2117.24     2117.212819 (0.999986)  0.030675 (0.000014)     0.006925    2.54413     0.008806    0.0285302
thread0::scale                      70126       1639.13     1402.350356 (0.855547)  236.777127 (0.144453)   0.017914    3.11279     0.023374    0.0220875
thread0::stack                      5009        651.86      601.077905 (0.922097)   50.782113 (0.077903)    0.106259    4.12809     0.130138    0.00878393
thread0::lookup_table               15027       479.71      388.084297 (0.808999)   91.625230 (0.191001)    0.024586    0.085041    0.0319232   0.00646417
thread0::fetch                      5009        333.739     333.664085 (0.999776)   0.074696 (0.000224)     0.054763    2.98519     0.0666278   0.00449719
thread0::TensorCopy:GPU->CPU        5009        264.276     264.216676 (0.999777)   0.059003 (0.000223)     0.043246    2.97079     0.0527602   0.00356116
thread0::GpuMemcpySync:GPU->CPU     5009        244.643     235.804134 (0.963870)   8.839005 (0.036130)     0.04009     1.58306     0.0488407   0.00329661
thread0::GpuMemcpyAsync:CPU->GPU    20036       226.793     195.272127 (0.861014)   31.521131 (0.138986)    0.007882    1.17277     0.0113193   0.00305608
thread0::slice                      5009        147.886     130.195979 (0.880383)   17.689569 (0.119617)    0.025414    0.103978    0.029524    0.00199278
thread0::tanh                       5009        122.009     112.002519 (0.917989)   10.006004 (0.082011)    0.019717    0.094528    0.0243579   0.00164409
thread0::feed                       20036       76.5896     76.581383 (0.999892)    0.008248 (0.000108)     0.002877    0.308487    0.0038226   0.00103206

Sep 09 '19 06:09 Xreki

fc+elementwise_add+layer_norm融合

--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
---  detected 24 subgraphs
.
.
.
I0911 06:41:55.494792  7238 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92849 ms per sample.
I0911 06:41:55.494873  7238 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.54723 ms per sample.

Profile

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fc                                255459      24269.7     12361.669096 (0.509346) 11908.043273 (0.490654) 0.039433    100.758     0.0950043   0.346056
thread0::fused_fc_elementwise_layernorm    115207      17699.9     6439.933305 (0.363841)  11259.921061 (0.636159) 0.103025    6.90034     0.153635    0.252378
thread0::matmul                            125225      7670.31     4985.129880 (0.649925)  2685.183125 (0.350075)  0.029412    6.91239     0.0612522   0.109369
thread0::transpose2                        240432      6889.05     5254.959850 (0.762799)  1634.087984 (0.237201)  0.023907    6.72677     0.0286528   0.0982293
thread0::softmax                           60108       4204.46     1927.686603 (0.458486)  2276.771181 (0.541514)  0.064249    2.83378     0.0699484   0.0599504
thread0::reshape2                          240432      2228.86     2228.844417 (0.999994)  0.014348 (0.000006)     0.007575    3.75591     0.00927023  0.0317808
thread0::elementwise_add                   75135       2160.14     1641.503732 (0.759904)  518.641058 (0.240096)   0.019577    6.72642     0.0287502   0.030801
thread0::scale                             70126       1736.05     1498.515875 (0.863178)  237.529504 (0.136822)   0.018853    6.66338     0.0247561   0.0247539
thread0::stack                             5009        664.48      615.254962 (0.925920)   49.224601 (0.074080)    0.10776     3.887       0.132657    0.00947466
thread0::layer_norm                        10018       592.675     324.577447 (0.547648)   268.097787 (0.452352)   0.050985    6.49622     0.059161    0.00845082
thread0::lookup_table                      15027       499.198     412.769395 (0.826866)   86.428197 (0.173134)    0.025502    6.66372     0.03322     0.00711794
thread0::fetch                             5009        362.799     362.676040 (0.999660)   0.123295 (0.000340)     0.05508     0.625043    0.0724295   0.00517307
thread0::TensorCopy:GPU->CPU               5009        297.886     297.782095 (0.999652)   0.103633 (0.000348)     0.043194    0.612426    0.0594701   0.00424748
thread0::GpuMemcpySync:GPU->CPU            5009        278.55      270.104589 (0.969679)   8.445858 (0.030321)     0.039548    0.609273    0.05561     0.00397179
thread0::GpuMemcpyAsync:CPU->GPU           20036       230.007     198.920144 (0.864846)   31.086396 (0.135154)    0.008387    0.708778    0.0114797   0.00327961
thread0::slice                             5009        152.385     135.526127 (0.889365)   16.859146 (0.110635)    0.026901    0.06877     0.0304223   0.00217283
thread0::tanh                              5009        122.456     112.432469 (0.918148)   10.023237 (0.081852)    0.020671    0.070096    0.0244471   0.00174607
thread0::feed                              20036       73.4209     73.411569 (0.999873)    0.009355 (0.000127)     0.002908    0.042732    0.00366445  0.00104689

Sep 11 '19 06:09 Xreki

multi-head attention融合

Before
After

--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
I1120 01:40:10.299883  6510 graph_pattern_detector.cc:97] ---  detected 12 subgraphs
--- Running IR pass [fc_fuse_pass]
I1120 01:40:10.344569  6510 graph_pattern_detector.cc:97] ---  detected 12 subgraphs
I1120 01:40:10.357105  6510 graph_pattern_detector.cc:97] ---  detected 26 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1120 01:40:10.397455  6510 graph_pattern_detector.cc:97] ---  detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1120 01:40:10.509421  6510 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I1120 01:40:10.758424  6510 analysis_predictor.cc:474] ======= optimize end =======
I1120 01:40:11.764863  6510 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W1120 01:40:12.444176  6510 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
.
.
.
I1120 01:40:43.862216  6510 ernie_tester.cc:345] Run 5010 samples, average latency: 6.40495 ms per sample.
I1120 01:40:43.862303  6510 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 5.74544 ms per sample.

Profile

------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fused_fc_elementwise_layernorm    115207      17021.4     5933.953144 (0.348617)  11087.450962 (0.651383) 0.09927     8.71612     0.147746    0.259676
thread0::multihead_matmul                  60108       11252.9     7224.504494 (0.642012)  4028.405127 (0.357988)  0.169878    6.57812     0.187212    0.171673
thread0::fc                                75135       10707.1     4357.612778 (0.406985)  6349.445604 (0.593015)  0.038466    85.6766     0.142504    0.163345
thread0::mul                               180324      9774.02     4994.708598 (0.511019)  4779.307233 (0.488981)  0.047804    9.68787     0.0542025   0.149111
thread0::fetch                             5009        4630.95     4629.392276 (0.999664)  1.554398 (0.000336)     0.055675    1.92049     0.924525    0.0706489
thread0::TensorCopy:GPU->CPU               5009        4573.44     4571.905707 (0.999664)  1.538373 (0.000336)     0.04468     1.90973     0.913045    0.0697717
thread0::GpuMemcpySync:GPU->CPU            5009        4553.78     4543.954556 (0.997843)  9.821141 (0.002157)     0.040962    1.9062      0.909119    0.0694716
thread0::stack                             5009        684.909     634.119137 (0.925844)   50.789981 (0.074156)    0.106366    2.25957     0.136736    0.0104489
thread0::layer_norm                        10018       549.522     290.654864 (0.528923)   258.867095 (0.471077)   0.048854    0.142508    0.0548535   0.00838342
thread0::lookup_table                      15027       479.099     392.001286 (0.818205)   87.097997 (0.181795)    0.024349    9.00457     0.0318826   0.00730906
thread0::elementwise_add                   15027       345.643     302.006333 (0.873752)   43.636814 (0.126248)    0.017625    3.1705      0.0230015   0.00527307
thread0::GpuMemcpyAsync:CPU->GPU           20036       229.869     198.181250 (0.862150)   31.687294 (0.137850)    0.008188    2.08116     0.0114728   0.00350684
thread0::scale                             10018       224.898     210.406303 (0.935565)   14.491370 (0.064435)    0.018757    0.143304    0.0224494   0.003431
thread0::slice                             5009        169.833     151.921321 (0.894535)   17.911378 (0.105465)    0.030184    0.072767    0.0339055   0.00259094
thread0::matmul                            5009        163.061     149.402692 (0.916239)   13.658130 (0.083761)    0.028052    0.956259    0.0325536   0.00248763
thread0::tanh                              5009        116.672     106.307159 (0.911166)   10.364377 (0.088834)    0.020194    0.067294    0.0232924   0.00177992
thread0::feed                              20036       71.6392     71.630728 (0.999882)    0.008476 (0.000118)     0.002808    0.034718    0.00357552  0.00109292

Timeline：fetch是在等待GPU计算完成

Nov 20 '19 01:11 Xreki

QA测试结果

测试时间：2019年12月24日
测试环境：
- CUDA 9.0
- CUDNN 7.3
- Paddle预测库版本： a5a8d14414213fadcfcd7dc60c794d1a515a390e
- 测试镜像： paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
- GPU: Tesla P4 8G(7611MB)
- GCC版本：镜像gcc482
测试结论：
1. ERNIE-GPU预测性能符合预期
2. GPU和CPU对相同数据的预测结果diff<1e-5
3. GPU同⼀个数据集反复预测，预测结果diff<1e-6

性能数据：

Seqlen128-data

batch_size	average_latency(ms)	batch_latency(ms)	gpu_memory(MB)
1	8.664	8.664	737
10	6.388	63.878	825
20	6.207	124.149	919
30	6.530	195.892	985
40	6.456	258.260	1085
50	6.445	322.275	1135
60	6.430	385.805	1279
70	6.457	451.965	1283
80	6.415	513.178	1339
90	6.481	583.301	1431
100	6.504	650.428	1485

Seqlen512-data

batch_size	average_latency(ms)	batch_latency(ms)	gpu_memory(MB)
1	36.123	36.123	781
10	31.663	316.630	1191
20	31.750	635.002	1637
30	32.280	968.412	2053
40	32.130	1285.212	2431
50	32.917	1645.835	2847
100	34.254	3425.380	4951
200			显存不⾜
300			显存不⾜

Feb 04 '20 06:02 Xreki

二期优化工作

负责人： @NHZlX
GPU平台信息：Tesla P4
软件信息：
- CUDA 10.0
优化工作

版本	P4时间 (ms)	记录时间	PR	版本描述	加速效果
0	10	2020-02-04	-	baseline，CUDA10的时间比CUDA9的时间长	-
1	9.5	2020-02-04	~~https://github.com/PaddlePaddle/Paddle/pull/22424~~ https://github.com/PaddlePaddle/Paddle/pull/22486	将3个fc也融合到了multihead_matmul_op中，将3个小的GEMM合成一个大的GEMM计算	5%
2	8.4	2020-02-04	~~https://github.com/PaddlePaddle/Paddle/pull/22424~~ https://github.com/PaddlePaddle/Paddle/pull/22494	融合3个lookup_table+elementwise_add+layer_norm，实现fused_embedding_eltwise_layernorm_op	11.6%
3	7.8	2020-02-05		fused_fc_elementwise_layernorm_op中计算layer_norm，使用double类型进行block内的reduce操作，改成float类型

embedding_eltwise_layernorm_fuse_pass匹配的子图模式

P4的浮点计算能力

Feb 04 '20 08:02 Xreki

benchmark
benchmark copied to clipboard

Optimize inference performance of ERNIE on P40 GPU

负责人

初始性能

Profile结果

TImeline分析

优化方案

Intel相关的一些工作

确定模型中的reshape操作是否必须的，是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki

结论：不能直接移除

优化效果汇总

profile结果

方法一：用`concat`代替`stack`，运行出现错误，发现`stack`和`concat`输出的维度不一样，不可行

方法二：参考`concat`，重新实现`stack`

fc+elementwise_add+layer_norm融合

multi-head attention融合

QA测试结果

二期优化工作

embedding_eltwise_layernorm_fuse_pass匹配的子图模式

P4的浮点计算能力

benchmark benchmark copied to clipboard

Optimize inference performance of ERNIE on P40 GPU

负责人

初始性能

Profile结果

TImeline分析

优化方案

Intel相关的一些工作

确定模型中的reshape操作是否必须的，是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki

结论：不能直接移除

优化效果汇总

profile结果

方法一：用concat代替stack，运行出现错误，发现stack和concat输出的维度不一样，不可行

方法二：参考concat，重新实现stack

fc+elementwise_add+layer_norm融合

multi-head attention融合

QA测试结果

二期优化工作

embedding_eltwise_layernorm_fuse_pass匹配的子图模式

P4的浮点计算能力

benchmark
benchmark copied to clipboard

方法一：用`concat`代替`stack`，运行出现错误，发现`stack`和`concat`输出的维度不一样，不可行

方法二：参考`concat`，重新实现`stack`