Yiqun Liu comments

Results 57 comments of


                                            Yiqun Liu

Optimize inference performance of ERNIE on P40 GPU

#### fc+elementwise_add+layer_norm融合 ``` --- Running IR pass [fc_elementwise_layernorm_fuse_pass] --- detected 24 subgraphs . . . I0911 06:41:55.494792 7238 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92849 ms per sample. I0911 06:41:55.494873...

Optimize inference performance of ERNIE on P40 GPU

#### multi-head attention融合 - Before ![image](https://user-images.githubusercontent.com/12538138/69202810-9e20f900-0b7d-11ea-8e31-f1de4c2a061d.png) - After ![image](https://user-images.githubusercontent.com/12538138/69202845-be50b800-0b7d-11ea-8d09-6a98cfb47054.png) ``` --- Running analysis [ir_graph_build_pass] --- Running analysis [ir_graph_clean_pass] --- Running analysis [ir_analysis_pass] --- Running IR pass [cudnn_placement_pass] --- Running IR...

Optimize inference performance of ERNIE on P40 GPU

## QA测试结果 - 测试时间：2019年12月24日 - 测试环境： - CUDA 9.0 - CUDNN 7.3 - Paddle预测库版本： a5a8d14414213fadcfcd7dc60c794d1a515a390e - 测试镜像： paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7 - GPU: Tesla P4 8G(7611MB) - GCC版本：镜像gcc482 - 测试结论： 1. ERNIE-GPU预测性能符合预期 2....

Optimize inference performance of ERNIE on P40 GPU

# 二期优化工作 - 负责人： @NHZlX - GPU平台信息：Tesla P4 - 软件信息： - CUDA 10.0 - 优化工作 | 版本 | P4时间 (ms) | 记录时间 | PR | 版本描述 | 加速效果 |---|---|---|---|---|---| |...

MobilenetV1 profile for optimizing

Paddle和TensorFlow的API性能测试和profile框架可依据[benchmark/api/paddle/abs.py](https://github.com/PaddlePaddle/benchmark/blob/master/api/paddle/abs.py)和[api/tensorflow/abs.py](https://github.com/PaddlePaddle/benchmark/blob/master/api/tensorflow/abs.py)编写，并往benchmark里面提交测试脚本。

Optimize the performance of PyramidDNN on CPU

#### search_pyramid_hash分析 1. 该Op里面调用了两个SSE函数`sse_axpy`和`sse_axpy_noadd`，可以尝试调用MKL。@luotao1 2. 需比较该Op的实现和竞品实现的diff。@luotao1 3. 确定该Op的框架开销。@zhaoyuchen2018 ![image](https://user-images.githubusercontent.com/12538138/61928005-7643cf00-afa9-11e9-9192-343dacc5cf39.png) 使用gperftools分析得到：`bloomfilter_get`在前向中占比最多。15%里面占比13%。 4. 如果只跑前向网络，search_pyramid_hash op的耗时，比跑整个网络的耗时少了7%。@zhaoyuchen2018

Optimize the performance of PyramidDNN on CPU

#### lookup_table & sequence_pool优化方案分析 @intel - 确定path - `lookup_table`：走的是LoDTensor path，path里面的padding_idx走了少数几次 - `lookup_table_grad`：走的是sparse的非grad_inplace path - `sequence_pool`：走的是SUM path - 结论：参照竞品的做法 1. 将embedding和sequence pool (sum) fuse起来 2. 调用（sparse）GEMM实现 - 具体分析： - embedding (`lookup_table`)...

Optimize the performance of PyramidDNN on CPU

#### 确定框架耗时，@zhaoyuchen2018 - 测试方法： - 测试结论：

Optimize the performance of Transformer-Big on 1 V100 GPU

#### Profile和Timeline分析结果 ``` Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. GpuMemcpyAsync:CPU->GPU 480 7787.05 7389.604243 (0.948961) 397.441730 (0.051039) 0.013041 420.263 16.223 0.133052 BufferedReader:MemoryCopy 20 7303.63...

Optimize the performance of Transformer-Big on 1 V100 GPU

问题：测试脚本中设置了`--fetch_steps 100`，意思是每100个step才fetch一次？如果每个step都fetch，速度是否有影响？竞品是如何fetch的？回答From @guoshengCS ：设置`--fetch_steps 100`对8卡训练速度有很大影响，但设置`--fetch_steps 5`和设置`--fetch_steps 100`的结果是差不多的。**对于单卡影响不大，需确认。**