Yiqun Liu comments

Results 57 comments of


                                            Yiqun Liu

Optimize the performance of Transformer-Big on 1 V100 GPU

#### CPU -> GPU数据拷贝分析 #### 分析方法 - 在`fluid/memory/memcpy.cc`里面加入log - 运行时设置`export GLOG_v=4` - 设置`exec_stratepy.num_threads=1` - 结果： ``` I0719 10:14:50.539557 70931 operator.cc:169] CUDAPlace(0) Op(increment), inputs:{X[@LR_DECAY_COUNTER@:int64_t[1]({})]}, outputs:{Out[@LR_DECAY_COUNTER@:int64_t[1]({})]}. I0719 10:14:50.539577 70931 operator.cc:1011] expected_kernel_key:data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN] I0719...

modify is_ampere_gpu in api/common/launch.py

@Courtesy-Xs 格式检查没通过，你需要： - 使用`pip install pre-ciommit`安装格式检查工具 - 在benchmark目录下执行`pre-commit install` ![image](https://user-images.githubusercontent.com/12538138/148510419-f2d9d5d5-e0db-4b60-b246-b4f3936dfe02.png) 另外，请按照提示注册CLA哈。

The design and optimization of API Benchmark

### Program = feed + abs + fetch - profile数据 ``` -------------------------> Profiling Report CPU 10 65.3952 39.898246 (0.610110) 25.496945 (0.389890) 6.46307 6.75515 6.53952 0.434411 thread0::fetch 10 42.5865 37.686449 (0.884939)...

Optimize the performance of seq2seq model on GPU

#### Profile结果 ``` -------------------------> Profiling Report GPU 18475 334.162 303.616585 (0.908592) 30.544991 (0.091408) 0.013014 0.120487 0.0180872 0.0309226 elementwise_mul_grad 6734 256.62 238.810646 (0.930602) 17.808921 (0.069398) 0.027794 7.31021 0.038108 0.023747 elementwise_mul 6864...

Optimize the performance of seq2seq model on GPU

#### timeline分析 - step总览： - 模型里面用到了两个StaticRNN结构？ - 可以看到明显的GPU空闲时间 ![image](https://user-images.githubusercontent.com/12538138/62753853-ec632e00-ba9f-11e9-859b-67b9dd65f962.png) ![image](https://user-images.githubusercontent.com/12538138/62753894-22081700-baa0-11e9-9f1e-70bd9c843e2b.png) - recurrent开始需要创建Operators，准备ExecutorPrepareContext。每个step用自己的step_scope，需创建Variables。 ![image](https://user-images.githubusercontent.com/12538138/62753934-49f77a80-baa0-11e9-88e9-d2b2546215c5.png) - 无论是前面的准备部分，还是StaticRNN里面，GPU利用率都不高。 ![image](https://user-images.githubusercontent.com/12538138/62754012-93e06080-baa0-11e9-823d-6e5a866eeaf2.png) ![image](https://user-images.githubusercontent.com/12538138/62754040-b3778900-baa0-11e9-9d54-2de2cea06d83.png) - 这些都是elementwise计算，可以考虑融合起来。正在开发支持这类融合的通用方法。 ![image](https://user-images.githubusercontent.com/12538138/62754083-f0dc1680-baa0-11e9-8b15-d7254b76f50f.png) - rnn_memory_helper引入很多GPU GPU之间的内存拷贝，应该考虑从设计上优化。 ![image](https://user-images.githubusercontent.com/12538138/62754130-2c76e080-baa1-11e9-87d7-85b7d69a67fe.png) - rerecurrent_grad中有个同步，需要很长时间的等待。 ![image](https://user-images.githubusercontent.com/12538138/62754253-a4450b00-baa1-11e9-9425-1442ea2bdfeb.png) - 梯度的聚合，使用很多sum_op（已经有PR在优化） ![image](https://user-images.githubusercontent.com/12538138/62754290-d22a4f80-baa1-11e9-8c74-2e5581eeb3e8.png)

tensorflow-2.0alpha跑竞品模型分别报如下错误

不要贴图啊，都看不清楚

Given some fusion examples.

### 两个相同配置的reduce_sum融合 - Program ```cpp var_0, var_1 = reduce_sum(x, dim=[1], keep_dim=false) var_2 = identity(x) var_3 = elementwise_mul(x, var_2, axis=-1) var_4, var_5 = reduce_sum(var_3, dim=[1], keep_dim=false) ``` - 当前X86生成代码如下 ```cpp I1124...

AICamera android: cannot run normally, because there is no resize process and yuv2rgb function return negative value?

What kind of error do you meet? We can successfully build the project, the only need is to put the paddle library and models in the right position. We provide...

AICamera android: cannot run normally, because there is no resize process and yuv2rgb function return negative value?

> After I read the project source code, I find that there is no resize process that deals with the difference in size between camera frame size and the model...

AICamera android: cannot run normally, because there is no resize process and yuv2rgb function return negative value?

Models should be put under `app/src/main/assets/models`.