grok-1 icon indicating copy to clipboard operation
grok-1 copied to clipboard

Hyperbolic tangent attention squashing, attention kernels

Open neverix opened this issue 11 months ago • 2 comments

The attention formula has an unusual line that puts attention weights through tanh.

  1. What is the reason for this? Normalization/entropy control? Is there a paper about it?
  2. Would a fast attention kernel that can support this operation be out of scope for this repo?

neverix avatar Mar 19 '24 03:03 neverix

Maybe rewrite the title of this issue to be more descriptive.

EwoutH avatar Mar 19 '24 07:03 EwoutH

I guess it's there to avoid gradient explosion, or vanishing gradients. Normalizing attention logits using tanh helps in controlling the gradients during the backpropagation process, as it avoids extreme values improving numerical stability.

sebdg avatar Mar 19 '24 15:03 sebdg