open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

use `TextEncoder` in coca `encode_image`

Open gpucce opened this issue 2 years ago • 1 comments

This PR adds the possibility to use a CLS token as a model parameter in TextEncoder to simplify the logic of CoCa

gpucce avatar Dec 24 '22 10:12 gpucce

@rom1504 please don't merge this yet

gpucce avatar Dec 24 '22 10:12 gpucce

@rom1504 this should be ready, essentially it addresses these points that were raised in the old PR and consequently adds possibility for text only encoder to be the Huggingface one, to make the multimodal one also available as HF is more difficult and I would do it in a different PR. As usual reviews, comments and anything else are more than welcome.

would appreciate your review @rwightman if you think anything big need to be done

Made some code review comments, most important points:

  • Make CoCa model dual tower (.visual, .text) from the start so builtin and HF text transformer models are the same, no bwd compat to be concerned about
  • Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
  • Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.

and then test test test.

Originally posted by @rwightman in https://github.com/mlfoundations/open_clip/issues/256#issuecomment-1356891072

gpucce avatar Dec 27 '22 11:12 gpucce

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

iejMac avatar Dec 28 '22 17:12 iejMac

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

Thanks for looking into it! For HF I didn't add a new Pooler but can do it if you think it is better

gpucce avatar Dec 28 '22 17:12 gpucce

For HF stuff from my quick look I have 2 concerns:

  1. Do all HF models have resize_token_embeddings?
  2. What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

iejMac avatar Dec 28 '22 17:12 iejMac

For HF stuff from my quick look I have 2 concerns:

  1. Do all HF models have resize_token_embeddings?

  2. What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

  1. Should be a method of PretrainedTransformer https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings

  2. The reason would be to have the model CLS token and the one used for contrastive loss to be different. And adding this through the HF tokenizer seems more difficult to make stable.

gpucce avatar Dec 28 '22 18:12 gpucce

@gpucce Re point 2: I think we actually want the CLS tokens to be the same. The idea would be to tune the CLS output embedding so that it's useful contrastively vs. just learning a completely new one. I realize this raises the issue of transformers that don't use CLS pooling but I don't see a problem with diverging from CoCa a bit here i.e. allowing for different pooling methods (like averaging) and using that as the text embedding for contrastive learning. To stay consistent with the paper we can make CLS pooling the pooling method for the default configuration.

iejMac avatar Dec 29 '22 13:12 iejMac

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

iejMac avatar Dec 29 '22 13:12 iejMac

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

Hi, I am getting to it now and I agree with you, I will remove the embed_cls option. Another issue might then be that what is fed to the generative side might not have any <end_of_text> token.

gpucce avatar Dec 29 '22 13:12 gpucce

@iejMac would it be bad to add the output_tokens option to the poolers themselves?

gpucce avatar Dec 29 '22 14:12 gpucce

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

iejMac avatar Dec 29 '22 14:12 iejMac

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

I was thinking because with CLS pooling the CLS embedding should not be in the tokens while with Mean/Max maybe yes? Can also pick from the second on for all of them

gpucce avatar Dec 29 '22 14:12 gpucce

Ah do you mean that for CLS pooling the tokens are x[1:] whereas for mean/max it's just x? If so I still think maybe jej ust keep poolers doing only pooling and then we can add some condition where it's like

if type(tokenizer) == ClsPooler:
    toks = model.hidden_state[1:]

or something like that, I forgot the exact syntax

iejMac avatar Dec 29 '22 14:12 iejMac

Hmm it would be nice to test this ClsPooler... do you know a single text encoder on huggingface that uses CLS pooling? Maybe BERT?

iejMac avatar Dec 29 '22 15:12 iejMac

Hmm it would be nice to test this ClsPooler... do you know a single text encoder on huggingface that uses CLS pooling? Maybe BERT?

I am trying with Roberta setting pooler_type to cls_pooler in the config and it seems to work , however the issue I am having now is that this way the tokens are 1 shorter than the labels and so the captioning loss errors, trying to find a simple way to adjust for it

EDIT: the test I am doing is (x.last_hidden_state[:, self.cls_token_position+1, :] == tokens[:, self.cls_token_position, :]).all()

gpucce avatar Dec 29 '22 15:12 gpucce

yeah I guess with CLS pooling you just need to adjust labels to also be shorter

iejMac avatar Dec 29 '22 15:12 iejMac

Ok so let's summarize this PR a bit. The tasks (from the TODO list in the coca -> main PR) that this PR resolves are:

  1. Make Coca use the visual encoder and text encoder of openclip fully
  2. Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
  3. Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.

Some notes:

  1. Cleans up the copied inference code from the coca branch and also adds compatibility with our HFTextEncoder class. We allow making CoCa architectures that differ from the original suggested CLS pooling by just using text encoders that have different pooling methods.
  2. Adds the option to use CLS pooling for the TextTransformer class

iejMac avatar Dec 29 '22 16:12 iejMac

@gpucce any notes from you?

iejMac avatar Dec 29 '22 16:12 iejMac

@gpucce any notes from you?

Nothing more, writing down what I think I missed:

  • make model dual tower, I think I misunderstood the point and the idea would be to remove self.multimodal_decoder and move it within self.text (maybe)
  • log clip/caption losses separately

gpucce avatar Dec 29 '22 16:12 gpucce

Oh and last thing this breaks .generate should be easy to fix but since there is also another PR working on it I would keep it as is for now.

gpucce avatar Dec 29 '22 17:12 gpucce

https://wandb.ai/iejmac/open-clip/reports/CoCa-ViT-B-32-v2--VmlldzozMjM0Nzg1 running this for a few epochs to see if the loss roughly matches up

iejMac avatar Dec 29 '22 18:12 iejMac

@gpucce something looks wrong, new loss is much higher Screen Shot 2022-12-29 at 7 31 08 PM

iejMac avatar Dec 29 '22 18:12 iejMac

@gpucce do you think we could log the losses separately in this PR? It would be much nicer for testing

iejMac avatar Dec 29 '22 18:12 iejMac

@iejMac sure I will add it here shortly, I have an idea what the issue is too, later I will try with it a bit and late write here what I find

gpucce avatar Dec 29 '22 18:12 gpucce

@iejMac this should log them separately

gpucce avatar Dec 29 '22 18:12 gpucce

@gpucce we should probably also log the total loss since it's weighted so that might contain important info

iejMac avatar Dec 29 '22 19:12 iejMac

@iejMac my idea about the issue seems like it's wrong, another thing could be that since pad tokens are now ignored in the loss it looks higher. Tomorrow I will try and check for other things, I saw your run seems to be doing ok, if you get any evaluation on a checkpoint please share, and sorry for taking a lot of time today.

gpucce avatar Dec 29 '22 22:12 gpucce

it's quite possible the loss will go down a little later

rom1504 avatar Dec 29 '22 23:12 rom1504

@gpucce if you look at the Wandb it looks like this PR consistently achieves much higher loss and slightly lower zero shot imagenet performance than our previous version (in the coca branch). Looking over the code again, do you have any ideas if the is expected and if not, what could be wrong? It looks like the captioning loss is pretty high, too bad we can't compare to previous run.

iejMac avatar Jan 02 '23 11:01 iejMac

@iejMac sorry I didn't have time to work on this sooner and also for wasting some compute:

  • higher loss is due to now ignoring pad_tokens and indeed performance is lower but close to the same
  • the small decrease in performance is my mistake, I added a projection that doesn't make sense and I am confident a small commit should fix it

Before fixing this and making the model not loadable again, I added a PR to CLIP_benchmark https://github.com/LAION-AI/CLIP_benchmark/pull/63 that works with both the coca branch and this one at this stage, I tested the old checkpoint and I get scores like the following

Bleu_1: 0.225, Bleu_2: 0.086, Bleu_3: 0.037, Bleu_4: 0.017, METEOR: 0.064, ROUGE_L: 0.169, CIDEr: 0.089, SPICE: 0.028

This is not really a good validation but should be enough to check if the model generative part is training, running on an untrained model gets all zeros. If you have time could try and run this with the old checkpoint as I did and with the one from the shorter run that finished a couple days ago if you have one, so we can know that generation is still working?

EDIT: Or of course share the checkpoint and I do it

gpucce avatar Jan 04 '23 16:01 gpucce