open_clip use `TextEncoder` in coca `encode

This PR adds the possibility to use a CLS token as a model parameter in TextEncoder to simplify the logic of CoCa

Dec 24 '22 10:12 gpucce

@rom1504 please don't merge this yet

Dec 24 '22 10:12 gpucce

@rom1504 this should be ready, essentially it addresses these points that were raised in the old PR and consequently adds possibility for text only encoder to be the Huggingface one, to make the multimodal one also available as HF is more difficult and I would do it in a different PR. As usual reviews, comments and anything else are more than welcome.

would appreciate your review @rwightman if you think anything big need to be done

Made some code review comments, most important points:

Make CoCa model dual tower (.visual, .text) from the start so builtin and HF text transformer models are the same, no bwd compat to be concerned about
Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.

and then test test test.

Originally posted by @rwightman in https://github.com/mlfoundations/open_clip/issues/256#issuecomment-1356891072

Dec 27 '22 11:12 gpucce

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

Dec 28 '22 17:12 iejMac

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

Thanks for looking into it! For HF I didn't add a new Pooler but can do it if you think it is better

Dec 28 '22 17:12 gpucce

For HF stuff from my quick look I have 2 concerns:

Do all HF models have resize_token_embeddings?
What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

Dec 28 '22 17:12 iejMac

For HF stuff from my quick look I have 2 concerns:

Do all HF models have resize_token_embeddings?

What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

Should be a method of PretrainedTransformer https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings
The reason would be to have the model CLS token and the one used for contrastive loss to be different. And adding this through the HF tokenizer seems more difficult to make stable.

Dec 28 '22 18:12 gpucce

@gpucce Re point 2: I think we actually want the CLS tokens to be the same. The idea would be to tune the CLS output embedding so that it's useful contrastively vs. just learning a completely new one. I realize this raises the issue of transformers that don't use CLS pooling but I don't see a problem with diverging from CoCa a bit here i.e. allowing for different pooling methods (like averaging) and using that as the text embedding for contrastive learning. To stay consistent with the paper we can make CLS pooling the pooling method for the default configuration.

Dec 29 '22 13:12 iejMac

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

Dec 29 '22 13:12 iejMac

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

Hi, I am getting to it now and I agree with you, I will remove the embed_cls option. Another issue might then be that what is fed to the generative side might not have any <end_of_text> token.

Dec 29 '22 13:12 gpucce

@iejMac would it be bad to add the output_tokens option to the poolers themselves?

Dec 29 '22 14:12 gpucce

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

Dec 29 '22 14:12 iejMac

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

I was thinking because with CLS pooling the CLS embedding should not be in the tokens while with Mean/Max maybe yes? Can also pick from the second on for all of them

Dec 29 '22 14:12 gpucce

Ah do you mean that for CLS pooling the tokens are x[1:] whereas for mean/max it's just x? If so I still think maybe jej ust keep poolers doing only pooling and then we can add some condition where it's like

if type(tokenizer) == ClsPooler:
    toks = model.hidden_state[1:]

or something like that, I forgot the exact syntax

Dec 29 '22 14:12 iejMac

Hmm it would be nice to test this ClsPooler... do you know a single text encoder on huggingface that uses CLS pooling? Maybe BERT?

Dec 29 '22 15:12 iejMac

Hmm it would be nice to test this ClsPooler... do you know a single text encoder on huggingface that uses CLS pooling? Maybe BERT?

I am trying with Roberta setting pooler_type to cls_pooler in the config and it seems to work , however the issue I am having now is that this way the tokens are 1 shorter than the labels and so the captioning loss errors, trying to find a simple way to adjust for it

EDIT: the test I am doing is (x.last_hidden_state[:, self.cls_token_position+1, :] == tokens[:, self.cls_token_position, :]).all()

Dec 29 '22 15:12 gpucce

yeah I guess with CLS pooling you just need to adjust labels to also be shorter

Dec 29 '22 15:12 iejMac

Ok so let's summarize this PR a bit. The tasks (from the TODO list in the coca -> main PR) that this PR resolves are:

Make Coca use the visual encoder and text encoder of openclip fully
Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.

Some notes:

Cleans up the copied inference code from the coca branch and also adds compatibility with our HFTextEncoder class. We allow making CoCa architectures that differ from the original suggested CLS pooling by just using text encoders that have different pooling methods.
Adds the option to use CLS pooling for the TextTransformer class

Dec 29 '22 16:12 iejMac

@gpucce any notes from you?

Dec 29 '22 16:12 iejMac

@gpucce any notes from you?

Nothing more, writing down what I think I missed:

make model dual tower, I think I misunderstood the point and the idea would be to remove self.multimodal_decoder and move it within self.text (maybe)
log clip/caption losses separately

Dec 29 '22 16:12 gpucce

Oh and last thing this breaks .generate should be easy to fix but since there is also another PR working on it I would keep it as is for now.

Dec 29 '22 17:12 gpucce

https://wandb.ai/iejmac/open-clip/reports/CoCa-ViT-B-32-v2--VmlldzozMjM0Nzg1 running this for a few epochs to see if the loss roughly matches up

Dec 29 '22 18:12 iejMac

@gpucce something looks wrong, new loss is much higher Screen Shot 2022-12-29 at 7 31 08 PM

Dec 29 '22 18:12 iejMac

@gpucce do you think we could log the losses separately in this PR? It would be much nicer for testing

Dec 29 '22 18:12 iejMac

@iejMac sure I will add it here shortly, I have an idea what the issue is too, later I will try with it a bit and late write here what I find

Dec 29 '22 18:12 gpucce

@iejMac this should log them separately

Dec 29 '22 18:12 gpucce

@gpucce we should probably also log the total loss since it's weighted so that might contain important info

Dec 29 '22 19:12 iejMac

@iejMac my idea about the issue seems like it's wrong, another thing could be that since pad tokens are now ignored in the loss it looks higher. Tomorrow I will try and check for other things, I saw your run seems to be doing ok, if you get any evaluation on a checkpoint please share, and sorry for taking a lot of time today.

Dec 29 '22 22:12 gpucce

it's quite possible the loss will go down a little later

Dec 29 '22 23:12 rom1504

@gpucce if you look at the Wandb it looks like this PR consistently achieves much higher loss and slightly lower zero shot imagenet performance than our previous version (in the coca branch). Looking over the code again, do you have any ideas if the is expected and if not, what could be wrong? It looks like the captioning loss is pretty high, too bad we can't compare to previous run.

Jan 02 '23 11:01 iejMac

@iejMac sorry I didn't have time to work on this sooner and also for wasting some compute:

higher loss is due to now ignoring pad_tokens and indeed performance is lower but close to the same
the small decrease in performance is my mistake, I added a projection that doesn't make sense and I am confident a small commit should fix it

Before fixing this and making the model not loadable again, I added a PR to CLIP_benchmark https://github.com/LAION-AI/CLIP_benchmark/pull/63 that works with both the coca branch and this one at this stage, I tested the old checkpoint and I get scores like the following

Bleu_1: 0.225, Bleu_2: 0.086, Bleu_3: 0.037, Bleu_4: 0.017, METEOR: 0.064, ROUGE_L: 0.169, CIDEr: 0.089, SPICE: 0.028

This is not really a good validation but should be enough to check if the model generative part is training, running on an untrained model gets all zeros. If you have time could try and run this with the old checkpoint as I did and with the one from the shorter run that finished a couple days ago if you have one, so we can know that generation is still working?

EDIT: Or of course share the checkpoint and I do it

Jan 04 '23 16:01 gpucce

open_clip
open_clip copied to clipboard

use `TextEncoder` in coca `encode_image`

open_clip open_clip copied to clipboard

use `TextEncoder` in coca `encode_image`

open_clip
open_clip copied to clipboard