Mega-pytorch
Mega-pytorch copied to clipboard
Implementation of Mega, the Single-head Attention with Multi-headed EMA architecture that currently holds SOTA on Long Range Arena
Results
1
Mega-pytorch issues
Sort by
recently updated
recently updated
newest added
1. For https://arxiv.org/pdf/2209.10655.pdf#page=21 , why use `x = sqrt(2)` specifically ? why is it not easier to just use `x = 1` ? data:image/s3,"s3://crabby-images/f12e4/f12e49a72435951e104678fcd01b54502f9fa4cd" alt="image" 2. In https://arxiv.org/pdf/2109.08668.pdf#page=5 , I do...