gaganbahga comments

Repositories
Issues
Comments

Results 3 comments of


                                            gaganbahga

FastAttention doesn't give results in agreement with standard attention?

@simonaxelrod I observed something similar. One more observation is that in the code that you linked the error decreases by a lot (to ~1.5%) if in calculating the standard attention,...

Load weights of transformer into PerformerLM

> @Mazgis47 the only big gotcha is that with the Performer like architecture, the masking will be different. Instead of deriving the N x N mask the setting the dot...

Recover attention scores

I don't believe that's possible because the order of computation is `(Q' (K'^T V))`. Would be interesting to know someone has a different idea/workaround.