Yiqun Liu comments

Results 57 comments of


                                            Yiqun Liu

如果下载simnet其他预训练模型？

您好，similarity_net当前只提供了simnet_bow-pairwise的预训练模型。

Optimize inference performance of ERNIE on P40 GPU

### Profile结果 ``` -------------------------> Profiling Report CPU 5009 210.123 210.076980 (0.999782) 0.045869 (0.000218) 0.036519 0.996315 0.0419491 0.00221879 thread0::GpuMemcpyAsync:CPU->GPU 20036 204.476 173.512282 (0.848571) 30.963505 (0.151429) 0.007 1.77937 0.0102054 0.00215916 thread0::GpuMemcpySync:GPU->CPU 5009...

Optimize inference performance of ERNIE on P40 GPU

### TImeline分析 - 总体来说，GPU利用的比较满。 ![image](https://user-images.githubusercontent.com/12538138/63002175-eb6b3b80-bea7-11e9-8040-55c4a9e6607d.png) - 开始时GPU存在不少空闲 ![image](https://user-images.githubusercontent.com/12538138/63001798-fb365000-bea6-11e9-8afa-e8d9eaeef37a.png) - stack op里面有cuMalloc和cuFree操作，并且有2次cuda stream的同步操作 ![image](https://user-images.githubusercontent.com/12538138/63005874-f4600b00-beaf-11e9-965d-e066e35db0d8.png) - 很多的reshape和transpose操作 - softmax是否使用cudnn？

Optimize inference performance of ERNIE on P40 GPU

### 优化方案 - [ ] 多个lookup_table操作的融合，@Xreki due to 2019年8月30日 - 有一个输入是[00000,11111]这样的id(只有0，1两个取值)，有的实现里把这部分换成matmul ![image](https://user-images.githubusercontent.com/12538138/63006717-ba900400-beb1-11e9-9de5-27c314cfd234.png) - [ ] 确定模型中的reshape操作是否必须的，是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki —— 见https://github.com/PaddlePaddle/benchmark/issues/165#issuecomment-521229670 - **不能直接移除** ![image](https://user-images.githubusercontent.com/12538138/63010791-9fc18d80-beb9-11e9-88ed-f255cde19ce1.png) - [x] 预测时，dropout可转变成scale，@Xreki due to 2019年8月22日 —— 见https://github.com/PaddlePaddle/Paddle/pull/19297 -...

Optimize inference performance of ERNIE on P40 GPU

### Intel相关的一些工作 - 移除attention模块中的reshape和transpose op，https://github.com/PaddlePaddle/Paddle/pull/16342 - 扩展matmul，以支持multi-head，https://github.com/PaddlePaddle/Paddle/pull/18570

Optimize inference performance of ERNIE on P40 GPU

#### 确定模型中的reshape操作是否必须的，是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki ![image](https://user-images.githubusercontent.com/12538138/63010791-9fc18d80-beb9-11e9-88ed-f255cde19ce1.png) ```text I0814 10:13:53.083331 11085 operator.cc:170] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 768]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 768]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}. I0814 10:13:53.083364 11085 operator.cc:993] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN] I0814 10:13:53.083400 11085 tensor_util.cu:28] TensorCopy 1,...

Optimize inference performance of ERNIE on P40 GPU

### 优化效果汇总 - 单位：ms/sample | 版本 | P40时间 (Total) |时间 (去掉第一次run)| 测试时间 | PR |版本描述 | 加速效果 | |---|---|---|---|---|---|---| | 0 | 8.36 | | 2019-08-14 | - | baseline...

Optimize inference performance of ERNIE on P40 GPU

在版本1的基础上，在预测了使用了`fuse_elewise_add_act_pass`，可以融合`elementwise_add`+`relu`的计算，并成功匹配到14个子图，但总体性能下降。 ```text + /paddle/build_paddle/build_docker_manylinux_cuda90/paddle/fluid/inference/tests/api/samples/ernie_tester --logtostderr --model_dir=/data/ernie/model --data=/data/ernie/seq128_data/test_ds --repeat=1 --warmup_steps=1 --use_gpu=true --use_analysis=true --print_outputs=false --profile=false [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from...

Optimize inference performance of ERNIE on P40 GPU

`stack`与`concat`的功能类似。尝试： #### 方法一：用`concat`代替`stack`，运行出现错误，发现`stack`和`concat`输出的维度不一样，不可行 - `stack`输出的维度 ``` I0820 09:17:21.934635 17390 operator.cc:168] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}),...

Optimize inference performance of ERNIE on P40 GPU

**实现fc的GPU kernel** ：https://github.com/PaddlePaddle/Paddle/pull/19687 **FC的融合**：https://github.com/PaddlePaddle/Paddle/pull/19733 - fc=mul+elementwise_add+relu，主要是将fc中的elementwise_add和relu计算融合起来。 - 测试结果，加速效果：2.1% ```text --- Running IR pass [fc_fuse_pass] --- detected 12 subgraphs WARNING: Logging before InitGoogleLogging() is written to STDERR I0909 06:52:40.161475 7826 fc_fuse_pass.cc:122]...