benchmark Optimize XLnet performance

Optimize XLnet performance

Open zhaoyuchen2018 opened this issue 5 years ago • 7 comments

模型提供的测试报告：

V100 单机单卡自测值： paddle 版本：develop 速度：0.961218 steps/s tf 1.15 速度: 1.61 step/s

Dec 19 '19 03:12 zhaoyuchen2018

从profile结果看 stack 和stack_grad op的 cpu耗时太多

Dec 19 '19 04:12 zhaoyuchen2018

从tracing文件来看 stack_grad耗时很多，很大可能在等待GPU的操作

Dec 19 '19 12:12 zhaoyuchen2018

paddle tf

op Calls time cost(ms) op calls time cost(ms)

stack_grad 12 719.9

stack 15 347.6

elementwise_mul 305 96 Mul 1203 132.4

matmul_grad 186 191.4 BatchMatMulV2 370 315.8

transpose2_grad 383 179.8

transpose2 406 197.3 Transpose 1236 260.6

matmul 193 74 MatMul 373 154.8

total 1806 863.6

	paddle		tf
op	Calls	time cost(ms)	op	calls	time cost(ms)
stack_grad	12	719.9
stack	15	347.6
elementwise_mul	305	96	Mul	1203	132.4
matmul_grad	186	191.4	BatchMatMulV2	370	315.8
transpose2_grad	383	179.8
transpose2	406	197.3	Transpose	1236	260.6
matmul	193	74	MatMul	373	154.8

total		1806			863.6