Ahmedd-Wahdan comments

Results 1 comments of


                                            Ahmedd-Wahdan

yeah , apparently some implementation do this optimization of feeding the MLP layer only the result of the attention of the last token