onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

QuickGelu Fusion

Open centwang opened this issue 2 years ago • 2 comments

Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad.

For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us image

After, FW takes 115us, BW takes 139us, which is much faster. image

For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs]

After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs]

centwang avatar Aug 02 '22 04:08 centwang

Do you know which model has this pattern?

ytaous avatar Aug 02 '22 04:08 ytaous

Do you know which model has this pattern?

MoE

centwang avatar Aug 02 '22 05:08 centwang

The speed up looks good. Didn't imagine this speed up for elementwise ops. Would something like auto fusing elementwise ops handle this scenario as well, @pengwa?

Also, does QuickGelu consistently outperform even simple Sigmoid?

Do we see any good speed up on any model with this change?

baijumeswani avatar Oct 25 '22 23:10 baijumeswani