[GPU] Swiglu OPT kernel implementation
Details:
- Implement Swiglu OPT kernel
- Use fast erf function instead of slow OpenCL erf
Tickets:
Is there a reason to add a new opt kernel without applying fast_erf to the ref kernel?
Is there a reason to add a new opt kernel without applying
fast_erfto the ref kernel?
Well, while fast_erf is the most important part of this performance-wise, this opt kernel has a simpler layout and later on it can be modified to support block reads (which I did and it did give more performance boost aswell, but it's more work which can be done later, so it's not included in this PR)
So it's just convenience in my opinion, plus it's slightly faster.
Is there a reason to add a new opt kernel without applying
fast_erfto the ref kernel?
Plus, also a faster implementation for tanh will be needed in the future, but it's out of scope of this PR
Is there a reason to add a new opt kernel without applying
fast_erfto the ref kernel?
Also, since this kernel has grown to support more than just swiglu, wouldn't you think it would make sense to rename this operation to something like "gated activation" in a later PR? I think the name now is a bit misleading.
Is there a reason to add a new opt kernel without applying
fast_erfto the ref kernel?Also, since this kernel has grown to support more than just swiglu, wouldn't you think it would make sense to rename this operation to something like "gated activation" in a later PR? I think the name now is a bit misleading.
Yes, I agree with you. "gated activation" or "GLU activation" would be better than the current name.