benchmark icon indicating copy to clipboard operation
benchmark copied to clipboard

Optimize XLnet performance

Open zhaoyuchen2018 opened this issue 5 years ago • 7 comments

模型提供的测试报告: image

V100 单机单卡自测值: paddle 版本:develop 速度:0.961218 steps/s tf 1.15 速度: 1.61 step/s

zhaoyuchen2018 avatar Dec 19 '19 03:12 zhaoyuchen2018

image

从profile结果看 stack 和stack_grad op的 cpu耗时太多

zhaoyuchen2018 avatar Dec 19 '19 04:12 zhaoyuchen2018

image

从tracing文件来看 stack_grad耗时很多,很大可能在等待GPU的操作

zhaoyuchen2018 avatar Dec 19 '19 12:12 zhaoyuchen2018

  paddle     tf    
op Calls time cost(ms)   op calls time cost(ms)
stack_grad 12 719.9        
stack 15 347.6        
elementwise_mul 305 96   Mul 1203 132.4
matmul_grad 186 191.4   BatchMatMulV2 370 315.8
transpose2_grad 383 179.8        
transpose2 406 197.3   Transpose 1236 260.6
matmul 193 74   MatMul 373 154.8
             
total   1806       863.6

OP对比如上图所示,完整的OP没有贴出来,首先针对这些占比较大的进行优化

zhaoyuchen2018 avatar Dec 26 '19 02:12 zhaoyuchen2018

优化stack op:https://github.com/PaddlePaddle/Paddle/pull/21940 优化后xlnet-ernie: 1.005,提升~4%

image

zhaoyuchen2018 avatar Dec 26 '19 02:12 zhaoyuchen2018

优化transpose后:1.337516 steps/s

image

image

zhaoyuchen2018 avatar Jan 02 '20 08:01 zhaoyuchen2018

image 在计算element_wise之前 大量时间被浪费在CPU和GPU的sync

zhaoyuchen2018 avatar Jan 13 '20 03:01 zhaoyuchen2018

image 优化了data transform之后性能提升~8%

zhaoyuchen2018 avatar Jan 16 '20 04:01 zhaoyuchen2018