Yiqun Liu

Results 57 comments of Yiqun Liu

#### CPU -> GPU数据拷贝分析 #### 分析方法 - 在`fluid/memory/memcpy.cc`里面加入log - 运行时设置`export GLOG_v=4` - 设置`exec_stratepy.num_threads=1` - 结果: ``` I0719 10:14:50.539557 70931 operator.cc:169] CUDAPlace(0) Op(increment), inputs:{X[@LR_DECAY_COUNTER@:int64_t[1]({})]}, outputs:{Out[@LR_DECAY_COUNTER@:int64_t[1]({})]}. I0719 10:14:50.539577 70931 operator.cc:1011] expected_kernel_key:data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN] I0719...

@Courtesy-Xs 格式检查没通过,你需要: - 使用`pip install pre-ciommit`安装格式检查工具 - 在benchmark目录下执行`pre-commit install` ![image](https://user-images.githubusercontent.com/12538138/148510419-f2d9d5d5-e0db-4b60-b246-b4f3936dfe02.png) 另外,请按照提示注册CLA哈。

### Program = feed + abs + fetch - profile数据 ``` -------------------------> Profiling Report CPU 10 65.3952 39.898246 (0.610110) 25.496945 (0.389890) 6.46307 6.75515 6.53952 0.434411 thread0::fetch 10 42.5865 37.686449 (0.884939)...

#### Profile结果 ``` -------------------------> Profiling Report GPU 18475 334.162 303.616585 (0.908592) 30.544991 (0.091408) 0.013014 0.120487 0.0180872 0.0309226 elementwise_mul_grad 6734 256.62 238.810646 (0.930602) 17.808921 (0.069398) 0.027794 7.31021 0.038108 0.023747 elementwise_mul 6864...

#### timeline分析 - step总览: - 模型里面用到了两个StaticRNN结构? - 可以看到明显的GPU空闲时间 ![image](https://user-images.githubusercontent.com/12538138/62753853-ec632e00-ba9f-11e9-859b-67b9dd65f962.png) ![image](https://user-images.githubusercontent.com/12538138/62753894-22081700-baa0-11e9-9f1e-70bd9c843e2b.png) - recurrent开始需要创建Operators,准备ExecutorPrepareContext。每个step用自己的step_scope,需创建Variables。 ![image](https://user-images.githubusercontent.com/12538138/62753934-49f77a80-baa0-11e9-88e9-d2b2546215c5.png) - 无论是前面的准备部分,还是StaticRNN里面,GPU利用率都不高。 ![image](https://user-images.githubusercontent.com/12538138/62754012-93e06080-baa0-11e9-823d-6e5a866eeaf2.png) ![image](https://user-images.githubusercontent.com/12538138/62754040-b3778900-baa0-11e9-9d54-2de2cea06d83.png) - 这些都是elementwise计算,可以考虑融合起来。正在开发支持这类融合的通用方法。 ![image](https://user-images.githubusercontent.com/12538138/62754083-f0dc1680-baa0-11e9-8b15-d7254b76f50f.png) - rnn_memory_helper引入很多GPU GPU之间的内存拷贝,应该考虑从设计上优化。 ![image](https://user-images.githubusercontent.com/12538138/62754130-2c76e080-baa1-11e9-87d7-85b7d69a67fe.png) - rerecurrent_grad中有个同步,需要很长时间的等待。 ![image](https://user-images.githubusercontent.com/12538138/62754253-a4450b00-baa1-11e9-9425-1442ea2bdfeb.png) - 梯度的聚合,使用很多sum_op(已经有PR在优化) ![image](https://user-images.githubusercontent.com/12538138/62754290-d22a4f80-baa1-11e9-8c74-2e5581eeb3e8.png)

### 两个相同配置的reduce_sum融合 - Program ```cpp var_0, var_1 = reduce_sum(x, dim=[1], keep_dim=false) var_2 = identity(x) var_3 = elementwise_mul(x, var_2, axis=-1) var_4, var_5 = reduce_sum(var_3, dim=[1], keep_dim=false) ``` - 当前X86生成代码如下 ```cpp I1124...

What kind of error do you meet? We can successfully build the project, the only need is to put the paddle library and models in the right position. We provide...

> After I read the project source code, I find that there is no resize process that deals with the difference in size between camera frame size and the model...