Ahmedd-Wahdan

Results 1 comments of Ahmedd-Wahdan

yeah , apparently some implementation do this optimization of feeding the MLP layer only the result of the attention of the last token