onnxruntime
onnxruntime copied to clipboard
QuickGelu Fusion
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad.
For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type:
Before, FW takes 335us, BW takes 614us
After, FW takes 115us, BW takes 139us, which is much faster.
For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs]
After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs]
Do you know which model has this pattern?
Do you know which model has this pattern?
MoE
The speed up looks good. Didn't imagine this speed up for elementwise ops. Would something like auto fusing elementwise ops handle this scenario as well, @pengwa?
Also, does QuickGelu
consistently outperform even simple Sigmoid
?
Do we see any good speed up on any model with this change?