vit-pytorch
vit-pytorch copied to clipboard
cls_token
All samples in a batch share the same cls_token(because in the code, the cls_token is repeated for batch_size), but how they change to be different during loss backward? As the cls_token was used as the classifier input, then all samples in a batch will be classified as the same label?
the CLS token is passed through the layers of attention and aggregates information from the rest of the tokens as it makes its way up
I had the same question here and here is my illustration about it. Please remind yourself that cls_token is a parameter, not a feature of input. Actually we can consider it as the starting point to give final label through self-att & mlp as info-aggregating procedures. By comparing the y_hat=f(cls_token, params|input) and true label y, cls_token and other params would be updated to be able to learn the effective way for aggregating infos from input.