open_clip Issue with number of tokens for CoCa

Hey, When calling _encode_image from CoCa, it should return two tensors, the image-level features (cls token/global avg) and the individual token features, so (image_size / 14) ** 2, right? However, it's only returning 255 tokens, so it seems like there's a token missing. I've attached a minimal example below.

import open_clip
import timm
from PIL import Image

(
    model,
    _,
    val_transform,
) = open_clip.create_model_and_transforms(
    "coca_ViT-L-14",
    pretrained="laion2B-s13B-b90k",
    precision="amp",
)
model.to("cuda")
model.eval()

img = Image.new("RGB", (224, 224))
img = val_transform(img).unsqueeze(0).to("cuda")

latent, tokens = model._encode_image(img)
print(latent.shape)  # torch.Size([1, 768])
print(tokens.shape)  # torch.Size([1, 255, 768])

model = timm.create_model("vit_large_patch14_224_clip_laion2b", pretrained=True)
model.to("cuda")
model.eval()
tokens = model.forward_features(img)
print(
    tokens.shape
)  # torch.Size([1, 257, 1024]) cls token + 256 tokens (224/14 * 224/14)

Mar 06 '23 23:03 vturrisi

Hi @vturrisi, so there are two things, the output from _encode_image in CoCa is already passed through attentional pooling, so in principle it might be different from the same in clip, it is defined in the coca config.

On the other end in the coca paper it might be that this should be 256 and I made a small mistake in the config, I have to check, however that would be a coincidence

Mar 07 '23 00:03 gpucce

@gpucce @vturrisi the seq len of the output of the attention pooler is determined by the size of the latent query, that’s 256 for CoCa in current config, this is technically no longer a ‘class’ and ‘spatial’ token but it’s still split like that so one of the 256 tokens is treated as the pooled, and the remaining as embeds, not sure if this was the intention, but you no longer have the original sequence length after the pooler

Mar 07 '23 00:03 rwightman

@rwightman @vturrisi this was the intention, because the embedding that makes sense to use for contrastive downstream tasks with coca is the one output by the pooler.

The only detail I am not 100% sure is if that should have been 257 to match the coca paper exactly.

Mar 07 '23 00:03 gpucce

@gpucce to match the paper, looking at it, it’s not a matter of making it 257, it’s a matter of needing two separate poolers, one for contrastive token w/ n=1, and the other for cap with n=256, regardless of what length it is, the behaviour is diff if you make it one pooler vs two (paper is two). And in any case, there is still a disconnect between the original sequence length as determined by the image tokens and the output after the pooler….

Mar 07 '23 00:03 rwightman

@rwightman ah now I see, I made a mistake reading the paper, I thought it worked how I wrote it.

Mar 07 '23 00:03 gpucce

@gpucce hmm, yeah, that’s a pickle, it obv still works fairly well (not abnormal in the wonderful world of DL / AI), we should add support for two poolers of (n=1 and n=256), keeping the current single one as an option for weight compat for now, and do a comparison, I think if anything it’d improve the contrastive more than the captioning behaviour….

Mar 07 '23 01:03 rwightman

@rwightman ok, indeed I was thinking about it, I believe that two poolers or one with one extra query are the same, except for the shared linear layer inside MultiHeadAttention

Mar 07 '23 01:03 gpucce

I don't see how they'd be equivalent with the softmax there...

On Mon, Mar 6, 2023, 3:17 PM Giovanni Puccetti @.***> wrote:

@rwightman https://github.com/rwightman ok, indeed I was thinking about it, I believe that two poolers or one with one extra query are the same, except for the shared linear layer inside MultiHeadAttention

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/458#issuecomment-1457322574, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLQICDUTZMHR4O2KBH57ZDW22EEHANCNFSM6AAAAAAVRYRRMQ . You are receiving this because you were mentioned.Message ID: @.***>

Mar 07 '23 01:03 rwightman

I don't see how they'd be equivalent with the softmax there...

@rwightman maybe I am just in denial, however, each row of the attention is one query dot product with all keys, and in turn softmax is over each row and then each output vector is the weighted sum of all values based on one of the attention rows.

Since keys and values would be the same in both poolers, this should make one pooler with one extra query the same as two poolers, however the linear layer after attention that multiplies all the output could be messing things up, so need to do as you said anyway.

Mar 07 '23 06:03 gpucce

@gpucce @rwightman thanks for the fast response. I didn't realize that in the paper they did this attention pooling and I wanted to play around with the indivisible visual tokens. Even though the image tokens are not used directly, I think they can still be useful, no? When the fix comes, is there also a way to expose these tokens to the user?

Mar 07 '23 07:03 vturrisi

@gpucce right yeah, chunking q is fine and done in other situations so this should be as well … mixed up my dim for the softmax. But yeah, the projections being shared is definitely different btw this and paper, question is how much it matters, might not be that significant…

Mar 07 '23 08:03 rwightman

@vturrisi I am planning on doing a PR that should improve huggingface integration and a few other changes, I will add in that as soon as I start working on that and I tag you

Mar 08 '23 14:03 gpucce

Sounds good!

Mar 08 '23 14:03 vturrisi

I don't see how they'd be equivalent with the softmax there...

@rwightman maybe I am just in denial, however, each row of the attention is one query dot product with all keys, and in turn softmax is over each row and then each output vector is the weighted sum of all values based on one of the attention rows.

Since keys and values would be the same in both poolers, this should make one pooler with one extra query the same as two poolers, however the linear layer after attention that multiplies all the output could be messing things up, so need to do as you said anyway.

Hi gpucce,

I think there might be a misunderstanding regarding the dual poolers. When there are two poolers involved, it implies that both poolers contain a MultiheadAttention component, each with its own set of projection layers. Therefore, the parameters of these projection layers in the two poolers are different and serve distinct purposes. The Contrastive Pooler is designed with a learnable token to extract information from the keys and values for contrastive learning. On the other hand, the Caption Pooler is equipped with 256 learnable tokens to handle captioning tasks. Given their entirely different objectives, their parameters are expected to vary significantly.

Currently, if the same pooler setup with 256 learnable tokens is being used, where one token is utilized for contrastive learning and the rest for captioning, this setup might lead to suboptimal results or perhaps no impact at all—it's hard to say for certain without further testing. This is my understanding of the paper. If you have time, you might want to experiment with this setup. Thank you for your contribution!

Warm regards,

May 24 '24 09:05 MrPetrichor

@gpucce right yeah, chunking q is fine and done in other situations so this should be as well … mixed up my dim for the softmax. But yeah, the projections being shared is definitely different btw this and paper, question is how much it matters, might not be that significant…

Hi rwightman,

I've been following your discussion and wanted to share my thoughts as well. I believe that having two poolers might result in a noticeable change in performance metrics. I'm looking forward to the new version of CoCa!

Best regards,

May 24 '24 09:05 MrPetrichor