x-transformers Question: rotary embeddings and bad length extrapolation

Question: rotary embeddings and bad length extrapolation

Open pfeatherstone opened this issue 1 year ago • 1 comments

In my tests, I've uncovered that rotary embeddings don't length extrapolate well. To be fair, you do mention this in your README. You suggest using rotary_xpos = True which should fix this but your attention becomes local.

Is this the best way to have good length extrapolation in a transformer network? Or is there a better positional embedding that doesn't suffer from this, yet works with flash attention and key-value mems?

I'll try using rotary_xpos but I don't like the idea of shortening the context length from something potentially really big to something small.

Thank you

Feb 19 '24 10:02 pfeatherstone

Other candidates are Alibi or no embeddings at all. For the last one, in order for it to work, do you need to train with a range of sizes so it can learn to length extrapolate well?

Feb 19 '24 10:02 pfeatherstone

x-transformers x-transformers copied to clipboard

Question: rotary embeddings and bad length extrapolation

x-transformers
x-transformers copied to clipboard