Results 3 comments of gaganbahga

@simonaxelrod I observed something similar. One more observation is that in the code that you linked the error decreases by a lot (to ~1.5%) if in calculating the standard attention,...

> @Mazgis47 the only big gotcha is that with the Performer like architecture, the masking will be different. Instead of deriving the N x N mask the setting the dot...

I don't believe that's possible because the order of computation is `(Q' (K'^T V))`. Would be interesting to know someone has a different idea/workaround.