gaganbahga
gaganbahga
@simonaxelrod I observed something similar. One more observation is that in the code that you linked the error decreases by a lot (to ~1.5%) if in calculating the standard attention,...
> @Mazgis47 the only big gotcha is that with the Performer like architecture, the masking will be different. Instead of deriving the N x N mask the setting the dot...
I don't believe that's possible because the order of computation is `(Q' (K'^T V))`. Would be interesting to know someone has a different idea/workaround.