grok-1
grok-1 copied to clipboard
Hyperbolic tangent attention squashing, attention kernels
The attention formula has an unusual line that puts attention weights through tanh
.
- What is the reason for this? Normalization/entropy control? Is there a paper about it?
- Would a fast attention kernel that can support this operation be out of scope for this repo?
Maybe rewrite the title of this issue to be more descriptive.
I guess it's there to avoid gradient explosion, or vanishing gradients. Normalizing attention logits using tanh helps in controlling the gradients during the backpropagation process, as it avoids extreme values improving numerical stability.