nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

why in transformer we compute for all tokens but then use only the last token for prediction?

Open Ahmedd-Wahdan opened this issue 1 year ago • 3 comments

the input is (B,T) to the transformer and the output from the MLP is also (B,T) and we only use the embeddings of the last column to predict the next token why cant we do something with the embeddings of the other tokens? it's my first time learning transformers

Ahmedd-Wahdan avatar Jan 09 '25 13:01 Ahmedd-Wahdan

try it to find out

bdytx5 avatar Mar 16 '25 15:03 bdytx5

I was wondering the same thing. It is useful during training to calculate the loss and adjust weights, but during prediction it seems the other token predictions are a waste.

captaindeadpool53 avatar May 09 '25 19:05 captaindeadpool53

yeah , apparently some implementation do this optimization of feeding the MLP layer only the result of the attention of the last token

Ahmedd-Wahdan avatar May 09 '25 21:05 Ahmedd-Wahdan