Ahmedd-Wahdan
Results
1
comments of
Ahmedd-Wahdan
yeah , apparently some implementation do this optimization of feeding the MLP layer only the result of the attention of the last token