why in transformer we compute for all tokens but then use only the last token for prediction?

Open Ahmedd-Wahdan opened this issue 1 year ago • 3 comments

the input is (B,T) to the transformer and the output from the MLP is also (B,T) and we only use the embeddings of the last column to predict the next token why cant we do something with the embeddings of the other tokens? it's my first time learning transformers

Jan 09 '25 13:01 Ahmedd-Wahdan

try it to find out

Mar 16 '25 15:03 bdytx5

I was wondering the same thing. It is useful during training to calculate the loss and adjust weights, but during prediction it seems the other token predictions are a waste.

May 09 '25 19:05 captaindeadpool53

yeah , apparently some implementation do this optimization of feeding the MLP layer only the result of the attention of the last token

May 09 '25 21:05 Ahmedd-Wahdan