wangchaochaohu comments

Results 8 comments of


                                            wangchaochaohu

MobilenetV1 profile for optimizing

CUDNN 7.5.1 ``` grep: warning: GREP_OPTIONS is deprecated; please use an alias or script ==6990== NVPROF is profiling process 6990, command: python mobilenet/test_paddle.py W0107 10:05:59.245507 6990 device_context.cc:236] Please NOTE: device:...

MobilenetV1 profile for optimizing

优化后 ``` ==127662== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 56.42% 512.18ms 10 51.218ms 32.783ms 213.03ms [CUDA memcpy DtoH] 38.71% 351.45ms 13 27.035ms 1.8240us 35.711ms...

Optimize the performance of Transformer-Big on 1 V100 GPU

CPU-->GPU 的数据从log来看是数据读取的部分。但是尝试了YOLOv3多进程读取数据的方法，性能并没有得到提升。

Optimize the performance of Transformer-Big on 1 V100 GPU

> CPU-->GPU 的数据从log来看是数据读取的部分。但是尝试了YOLOv3多进程读取数据的方法，性能并没有得到提升。在我本机上(CUDA10.0) - 如果原始代码 export FLAGS_reader_queue_speed_test_mode=True 性能提升很小大概是从1.86---->1.92左右差不多 - 如果改成YOLOv3多进程的方式 - export FLAGS_reader_queue_speed_test_mode=True 那么大概会从1.86---->2.19左右的提升 - 但是export FLAGS_reader_queue_speed_test_mode=False 就没有提升

Optimize the performance of Transformer-Big on 1 V100 GPU

> > CPU-->GPU 的数据从log来看是数据读取的部分。但是尝试了YOLOv3多进程读取数据的方法，性能并没有得到提升。 > > 在我本机上(CUDA10.0) > > * 如果原始代码 export FLAGS_reader_queue_speed_test_mode=True 性能提升很小大概是从1.86---->1.92左右差不多 > * 如果改成YOLOv3多进程的方式 > > * export FLAGS_reader_queue_speed_test_mode=True 那么大概会从1.86---->2.19左右的提升 > * 但是export FLAGS_reader_queue_speed_test_mode=False 就没有提升关于多进程的写法需要@邓凯鹏...

Optimize the performance of Transformer-Big on 1 V100 GPU

#### 优化dropout实现 ##### 1. 利用cuDNN提供的dropout api的实现实现dropout_cudnn_op，PaddlePaddle/Paddle#18954 - 遇到的问题： - mask shape不一致问题，CUDNN为节省显存，Mask 是使用位存储的 - cache问题，我们的OP Test 前向测试并未实现隔离，当创建同名的Cache Var的时候会造成共用一个Var。 - transformer-big模型加速效果，性能提升约：10% - 实验环境：V100 + CUDA10.0 - 单GPU训练速度： 1.852 step/s-> 2.040 step /s...

Optimize the performance of Transformer-Big on 1 V100 GPU

Label Smooth优化 PaddlePaddle/Paddle#19175 transformer-big模型测试: 无性能提升在transformer-big模型中利用PaddlePaddle的profile工具测试单个OP 平均时间:3.51607----------->2.39707(ms)

Optimize the performance of Transformer-Big on 1 V100 GPU

对于 cast OP 和increment OP选择CPU Kernel计算的原因是因为我们的代码在这两个OP选择CPU或者GPU算法的时候是根据输入数据是在CPU还是在GPU上进行选择的。修改代码，使用两个OP的GPU kernel type运行transformer-big训练过程，训练速度变化如下： 1.852 --------->1.844 （step /s）本质上数据的data transform 是无法避免的，只不过是在哪个OP进行。