grok-1 Hyperbolic tangent attention squashing, attention kernels

Hyperbolic tangent attention squashing, attention kernels

Open neverix opened this issue 11 months ago • 2 comments

The attention formula has an unusual line that puts attention weights through tanh.

What is the reason for this? Normalization/entropy control? Is there a paper about it?
Would a fast attention kernel that can support this operation be out of scope for this repo?

Mar 19 '24 03:03 neverix

Maybe rewrite the title of this issue to be more descriptive.

Mar 19 '24 07:03 EwoutH

I guess it's there to avoid gradient explosion, or vanishing gradients. Normalizing attention logits using tanh helps in controlling the gradients during the backpropagation process, as it avoids extreme values improving numerical stability.

Mar 19 '24 15:03 sebdg

grok-1 grok-1 copied to clipboard

Hyperbolic tangent attention squashing, attention kernels

grok-1
grok-1 copied to clipboard