CLIP
CLIP copied to clipboard
Question about clip.encode_text
To whom it may concern,
I'm checking the code of clip. It is simple but wonderful. However, I found Line 350 in clip/model.py
: x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
may be a little bit confusing to me. May I ask why use this dimension (text.argmax(dim=-1)
) out of the 77 dimensions of the tokenized text? Why not using the average of all dimension? Thanks a lot.
P.S. My understanding of text.argmax(dim=-1)
: it indicates the location of the end of the input text.
Best
You're correct that the argmax operation takes the representation at the EOT position. There's nothing inherently wrong with taking the average along the sequence dimension, but taking representation at the position of a special token (e.g. the CLS token in ViT and BERT) is empirically known to work better.
Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.
Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.
Other representations are still used since in each attention layer, the [EOT] token is attended to every other location.
The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding?
@ygfrancois
The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding?
the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT]
@ygfrancois The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding? the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT]
Thank you @LikeGiver, I understand this point now. Argmax is used to locate the index (i_eot) of [EOT] at tokenized prompts. Once we locate it , we use the features of [EOT] by x[batchsize, i_eot] to represent the features of prompts
I have the same question. In my prompts, the number 49407 mean the end token of each prompt, not the useful meaning of each token. for example, [49406, 518, 34606, 771, 4267, 7863, 6898, 518, 4960, 2445, 537, 791, 1025, 33811, 538, 26878, 49407, 0, 0, 0, 0, ..., 0], I think i should choose the meaningful token except 49406 and 49407. Is it work?