rotary-embedding-torch
rotary-embedding-torch copied to clipboard
LieRE: Generalizing Rotary Position Encodings. Beats RoPE-mixed by large margin and is much faster (compute-wise)
Hi, @lucidrains !
There was a promising research published this month (vs. RoPE-mixed (#25) in March), the so-called LieRE positional encodings generalize the kv-vector rotation to any numbers of dimension (1D, 2D, 3D, etc....), and are much simpler than RoPE in formulation. More than that, they result in much better models accuracy and 25%+ faster training than either axial RoPE or RoPE-mixed. I think their paper was really underappreciated, and this approach will be revolutionary.
LieRE leads to marked improvements in performance (up to 6%), training efficiency (3.5x reduction), data efficiency (30%) compared to the baselines of RoFormer, DeiT III, RoPE-Mixed and Vision-Llama.
The paper is here https://arxiv.org/abs/2406.10322. The LieRE authors gave only the pseudocode for now, however it looks extremely simple.
It looks easy, but I'm a bit confused how to implement the block-diagonal skew matrix with minimal learnable components and structure preservation. (stack of n x 1D parameters + tril_indices + block matrix?) Also integrating block-sparse optimizations for fast rotations would be nice to have