open_clip
open_clip copied to clipboard
use `TextEncoder` in coca `encode_image`
This PR adds the possibility to use a CLS token as a model parameter in TextEncoder to simplify the logic of CoCa
@rom1504 please don't merge this yet
@rom1504 this should be ready, essentially it addresses these points that were raised in the old PR and consequently adds possibility for text only encoder to be the Huggingface one, to make the multimodal one also available as HF is more difficult and I would do it in a different PR. As usual reviews, comments and anything else are more than welcome.
would appreciate your review @rwightman if you think anything big need to be done
Made some code review comments, most important points:
- Make CoCa model dual tower (.visual, .text) from the start so builtin and HF text transformer models are the same, no bwd compat to be concerned about
- Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
- Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.
and then test test test.
Originally posted by @rwightman in https://github.com/mlfoundations/open_clip/issues/256#issuecomment-1356891072
Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later
Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later
Thanks for looking into it! For HF I didn't add a new Pooler but can do it if you think it is better
For HF stuff from my quick look I have 2 concerns:
- Do all HF models have resize_token_embeddings?
- What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65
For HF stuff from my quick look I have 2 concerns:
Do all HF models have resize_token_embeddings?
What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65
-
Should be a method of PretrainedTransformer https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings
-
The reason would be to have the model CLS token and the one used for contrastive loss to be different. And adding this through the HF tokenizer seems more difficult to make stable.
@gpucce Re point 2: I think we actually want the CLS tokens to be the same. The idea would be to tune the CLS output embedding so that it's useful contrastively vs. just learning a completely new one. I realize this raises the issue of transformers that don't use CLS pooling but I don't see a problem with diverging from CoCa a bit here i.e. allowing for different pooling methods (like averaging) and using that as the text embedding for contrastive learning. To stay consistent with the paper we can make CLS pooling the pooling method for the default configuration.
Let me know what you think and also if you're busy. I can get to it myself actually in case you are
Let me know what you think and also if you're busy. I can get to it myself actually in case you are
Hi, I am getting to it now and I agree with you, I will remove the embed_cls
option. Another issue might then be that what is fed to the generative side might not have any <end_of_text> token.
@iejMac would it be bad to add the output_tokens
option to the poolers themselves?
Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?
Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?
I was thinking because with CLS pooling the CLS embedding should not be in the tokens while with Mean/Max maybe yes? Can also pick from the second on for all of them
Ah do you mean that for CLS pooling the tokens are x[1:] whereas for mean/max it's just x? If so I still think maybe jej ust keep poolers doing only pooling and then we can add some condition where it's like
if type(tokenizer) == ClsPooler:
toks = model.hidden_state[1:]
or something like that, I forgot the exact syntax
Hmm it would be nice to test this ClsPooler... do you know a single text encoder on huggingface that uses CLS pooling? Maybe BERT?
Hmm it would be nice to test this ClsPooler... do you know a single text encoder on huggingface that uses CLS pooling? Maybe BERT?
I am trying with Roberta setting pooler_type to cls_pooler
in the config and it seems to work , however the issue I am having now is that this way the tokens are 1 shorter than the labels and so the captioning loss errors, trying to find a simple way to adjust for it
EDIT: the test I am doing is (x.last_hidden_state[:, self.cls_token_position+1, :] == tokens[:, self.cls_token_position, :]).all()
yeah I guess with CLS pooling you just need to adjust labels to also be shorter
Ok so let's summarize this PR a bit. The tasks (from the TODO list in the coca -> main PR) that this PR resolves are:
- Make Coca use the visual encoder and text encoder of openclip fully
- Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
- Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.
Some notes:
- Cleans up the copied inference code from the coca branch and also adds compatibility with our HFTextEncoder class. We allow making CoCa architectures that differ from the original suggested CLS pooling by just using text encoders that have different pooling methods.
- Adds the option to use CLS pooling for the TextTransformer class
@gpucce any notes from you?
@gpucce any notes from you?
Nothing more, writing down what I think I missed:
- make model dual tower, I think I misunderstood the point and the idea would be to remove
self.multimodal_decoder
and move it withinself.text
(maybe) - log clip/caption losses separately
Oh and last thing this breaks .generate
should be easy to fix but since there is also another PR working on it I would keep it as is for now.
https://wandb.ai/iejmac/open-clip/reports/CoCa-ViT-B-32-v2--VmlldzozMjM0Nzg1 running this for a few epochs to see if the loss roughly matches up
@gpucce something looks wrong, new loss is much higher
@gpucce do you think we could log the losses separately in this PR? It would be much nicer for testing
@iejMac sure I will add it here shortly, I have an idea what the issue is too, later I will try with it a bit and late write here what I find
@iejMac this should log them separately
@gpucce we should probably also log the total loss since it's weighted so that might contain important info
@iejMac my idea about the issue seems like it's wrong, another thing could be that since pad tokens are now ignored in the loss it looks higher. Tomorrow I will try and check for other things, I saw your run seems to be doing ok, if you get any evaluation on a checkpoint please share, and sorry for taking a lot of time today.
it's quite possible the loss will go down a little later
@gpucce if you look at the Wandb it looks like this PR consistently achieves much higher loss and slightly lower zero shot imagenet performance than our previous version (in the coca branch). Looking over the code again, do you have any ideas if the is expected and if not, what could be wrong? It looks like the captioning loss is pretty high, too bad we can't compare to previous run.
@iejMac sorry I didn't have time to work on this sooner and also for wasting some compute:
- higher loss is due to now ignoring pad_tokens and indeed performance is lower but close to the same
- the small decrease in performance is my mistake, I added a projection that doesn't make sense and I am confident a small commit should fix it
Before fixing this and making the model not loadable again, I added a PR to CLIP_benchmark https://github.com/LAION-AI/CLIP_benchmark/pull/63 that works with both the coca branch and this one at this stage, I tested the old checkpoint and I get scores like the following
Bleu_1: 0.225, Bleu_2: 0.086, Bleu_3: 0.037, Bleu_4: 0.017, METEOR: 0.064, ROUGE_L: 0.169, CIDEr: 0.089, SPICE: 0.028
This is not really a good validation but should be enough to check if the model generative part is training, running on an untrained model gets all zeros. If you have time could try and run this with the old checkpoint as I did and with the one from the shorter run that finished a couple days ago if you have one, so we can know that generation is still working?
EDIT: Or of course share the checkpoint and I do it