CLIP Question about clip.encode

To whom it may concern,

I'm checking the code of clip. It is simple but wonderful. However, I found Line 350 in clip/model.py: x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection may be a little bit confusing to me. May I ask why use this dimension (text.argmax(dim=-1)) out of the 77 dimensions of the tokenized text? Why not using the average of all dimension? Thanks a lot.

P.S. My understanding of text.argmax(dim=-1): it indicates the location of the end of the input text.

Best

Feb 15 '22 06:02 AntonotnaWang

You're correct that the argmax operation takes the representation at the EOT position. There's nothing inherently wrong with taking the average along the sequence dimension, but taking representation at the position of a special token (e.g. the CLS token in ViT and BERT) is empirically known to work better.

Apr 11 '22 01:04 jongwook

Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.

Dec 15 '22 12:12 Teoge

Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.

Other representations are still used since in each attention layer, the [EOT] token is attended to every other location.

Dec 19 '22 20:12 powpos360

The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding?

Apr 25 '23 11:04 ygfrancois

@ygfrancois The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding? the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT] Screenshot from 2023-04-25 22-12-20

Apr 25 '23 13:04 LikeGiver

@ygfrancois The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding? the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT]

Thank you @LikeGiver, I understand this point now. Argmax is used to locate the index (i_eot) of [EOT] at tokenized prompts. Once we locate it , we use the features of [EOT] by x[batchsize, i_eot] to represent the features of prompts

Dec 16 '23 07:12 stardusts-hj

I have the same question. In my prompts, the number 49407 mean the end token of each prompt, not the useful meaning of each token. for example, [49406, 518, 34606, 771, 4267, 7863, 6898, 518, 4960, 2445, 537, 791, 1025, 33811, 538, 26878, 49407, 0, 0, 0, 0, ..., 0], I think i should choose the meaningful token except 49406 and 49407. Is it work?

Dec 25 '23 15:12 ZHUYMGeo

CLIP
CLIP copied to clipboard

Question about clip.encode_text

CLIP CLIP copied to clipboard

Question about clip.encode_text

CLIP
CLIP copied to clipboard