open_clip
open_clip copied to clipboard
adding CoCa
The PR idea is to add the CoCa model as implemented in https://github.com/lucidrains/CoCa-pytorch, using existing parts as much as possible.
Ideally adding possibilty to choose between custom and non custom Attention implementation as is done for CLIP.
It would be best to see if it's possible to unify with existing code both the tower support and the losses So we can train with a variety of text towers and we may benefit from the existing efficient implementation of losses without too much duplicate code
Sure, I will try and reuse as much as I can, for now it is mostly copied from the coca-pytorch repo, will probably ask for some help while I move on :)
@rom1504 I will reuse the visual_model from open_clip, however in coca-pytorch the transformer layer for the text model are different from the regular ones, feed_forward and attention are parallel, do you prefer like that or regular ones? I have no idea how much difference it makes.
Even if I use the regular attention I think the current implementation doesn't allow cross attention, would you prefer a CrossAttention layer or adding the crossattention possibility to the regular attention with kwargs to the forward?
I think let's bring options into current text model so they support coca
- Parallel feed forward and self attention can be added as an option. I think it won't make a huge difference for these relatively small models. Gpt-j and palm used this mecanism to improve speed at large scale.
- cross attention is indeed the important feature. I think bringing it to the existing model could be great. That should be possible for our text attention implementation. For HF encoders there's a chance it's already implemented and we can just use their implementation
Thanks for working on this!
@rom1504 I am moving forward, if you have time could you just have a look at how the cross attention and decoder are added to existing models to see if the integration is going in a reasonable direction?
Hey, posting a summary of what I see in the current PR
- coca cfg : new kind of model
- allows using txt cfg and img cfg on top of coca cfg
- coca class that uses visual and text encoders / similar to clip class
- coca loss : uses clip loss + introduces caption loss
- adds optional cross attention in existing text transformer
- introduces MultimodalTransformerDecoder
- CoCaVisionTransformer ; why do we need this one ?
- COCaTextTransformer why do we need this one ?
I think the direction looks ok.
I would try to focus on:
- not breaking the current use cases: normal clip training and inference should still work exactly the same
- trying to minimize changes and new code as much as possible
- continuing on the integration path you took, much better than duplicating everything
thanks for the work.
@iejMac may help review as well
Ok before I look at it I'm putting down some of my unbiased thoughts as to how I think we should go about this:
- The only things that we should need to add in transformers.py are an attentional pooler and a text decoder that can either be unimodal or multimodal based on what you pass into the forward function
- A CoCa class that initializes all of it's transformers and implements the correct forward passes through each of them
- The encoder models should work with current code - hf_model, timm_model (but this can be left for the end once everything else is working in default conditions)
Now as for the loss I see two ways of going about this:
- A specific training script and loss function for each algorithm we do (CoCa, FLAVA, EVA)
- We try to generalize a single "task" function which is parameterized by a list of loss functions you want to use so for CoCa you'd do like loss="clip,caption", for FLAVA you'd do loss="clip,MLM,MIM,..."
Not sure which one we try out first for this. Would love to hear some more thoughts on this from everyone.
@gpucce I think the current state of the PR is good however I see you do a lot of things to make it exactly CoCa, which is good but it creates some potentially unnecessary additions to open_clip where we could've just used something we already have in the first place. What I recommend doing is going over all of your classes and thinking about which you could just be a special case of something that already exists, then you take note of what needs to be added so it actually fits the CoCa paper and then we can discuss how to best make those additions. So for example I think AttentionPooler is a must add, but the CoCaVisionTransformer can probably be made a special case of our VisionTransformer class in transformers.py
@iejMac. thanks for the feedback! The reason for those CoCa(Text/Vision)Transformer
models was that I think the forward of (Text/Vision)Transformer
needs to be changed a bit, however after thinking about it, it should be better to redefine embed_text
and embed_image
for CoCa
, in CLIP
the same is done for embed_text
I think, and doing so I will remove those two classes.
Instead the AttentionPooler
, the MultimodalTransformerDecord
and CoCaMultimodalTransformer
should all three be needed and make sense, maybe renaming the last two just TransformerDecoder
and MultimodalTransformer
. An alternative is making only one model of TransformerDecoder
and MultimodalTransformer
but to me it would look a bit different than the rest of the logic of transformers.py
, however of course I'll make it as you prefer.
About point 3. in the comment above I am trying to make it like that, will see if I manage to make it work nicely with all combinations of HF and custom.
And about the training just to add what I thought about it, I think multiple script, even if they will look probably very similar, is better at first, although this may be because I don't really see the level of overlap, I don't know either EVA or FLAVA very much.
I will try and move on in the next couple of days and see where I get!
EDIT:
Didn't see the new comments while I was writing, the reason I didn't add an argument in the model init to add this as an option is that I thought it was too much of a special case but can also do it like that with an output_all
flag
@gpucce so about the TransformerDecoder and MultimodalTransformer. So I agree you need the TransformerDecoder since currently there are no decoding transformers in open_clip but can't you write the TransformerDecoder such that the MultimodalTransformerDecoder is just a special case of the TransformerDecoder - one that is the result of passing in the correct initialization parameters and also inputs into forward? Maybe I'm wrong about this but it seems there are no major architectural differences between the two and it's only about what you use for qkv.
I've thought a bit more about the training situation and I think for now (as you say) we should do a separate script and then we can maybe do another method (EVA or FLAVA) at that point we can have a look at what all 3 look like and abstract the commonalities if it makes sense.
Hi @iejMac, I should have done most of it, if you can have a look it would be highly appreciated!
The one thing I did different from what you said is in the end, for now, I didn't write a different train script because I was able to integrate it with very very little changes so I thought I would try it, if you think it is too messy anyway I make a new one!
It also looks like it runs on 8-A100 gpus single machine. However if you can confirm it somehow would be really nice. And that is the largest I can try on.
Will need to revise everything but an opinion on things that need changing would be nice while I work on that!
Also I don't know why but it looks like pytest is not in the environment used by github
Writing a summary as I review for future reviewers (context window too long for GPTReview 😄):
Most important files: coca_model.py, transformer.py, loss.py - that's where the CoCa stuff is implemented, rest is just integration with our stuff.
CoCa Model transformer.py:
- adapts Attention, and ResidualAttentionBlock to support cross-attention functionality (base functionality should be the same).
- AttentionalPooler: cross attention with n_query learnable query vectors, used for task-specific attentional pooling (you want a different image embedding for the contrastive task and for the captioning task so you apply different poolers on top of your image encoder)
- TransformerDecoder: this is supposed currently implemented as a multimodal transformer decoder, the regular transformer decoder is just a standard Transformer which we already have implemented so we just don't pool the output
coca_model.py:
- adds TextDecoderCfg since our MultimodalTransformer (currently TransformerDecoder) has some different params
- creates CoCa model which can encode_image and encode_text separately however in it's forward call it merges the two generations through multimodal cross attention and outputs everything needed for training
loss.py:
- ads CoCaLoss class which inherits ClipLoss for contrastive loss computation
Training
- adds create_loss func to factory.py which selects the loss you want to use during training
- in main.py you now also create your loss function and pass this into train_one_epoch (you do this every epoch)
- in train.py we store all of the model outputs in a single variable and unpack that into the loss class to calculate the loss without having to know what exact loss function we're using
My thoughts (@gpucce, @rom1504): It's good you implemented the TransformerDecoder stuff because now we can see that we don't really need a TransformerDecoder. Instead I say we just rename your TransformerDecoder to a MultimodalTransformer and that will correctly encompass what it does - it's a Transformer that can take two different modalities. That way we can indeed just keep the self.cross_attn in it because that's what you need for multimodal attention.
As to training I like this approach, the algorithm should would regardless of what loss function you're using. I think we can try this out and maybe try to add one more method like FLAVA to see if all 3 fit well in this framework - parameterize your model/loss and the training code just goes with what those input/output.
Also maybe we should rename the model.py file to clip_model.py, would less ambiguous if we plan to merge this.
In general this is coming together very nicely, great job!
@iejMac One thing, is the idea that either both tokeniser and model are HF or neither of them are? Or is mixed allowed?
One thing, is the idea that either both tokeniser and model are HF or neither of them are? Or is mixed allowed?
indeed it should be both together
could you please add a section in the readme on how to train it ? I figure adding a few configurations (for say B/32 and L/14 at least) would be nice too
oh I see you already added a config. would be nice to call it something like coca_b_32_roberta_base.json so it's explicit
ok looking at the config further, it seems very different from clip configs could you add a config looking as much as possible like clip ViT-B/32 so we could compare coca vs clip easily ?
and last comment, just tried to run it and getting
TypeError: init() got an unexpected keyword argument 'decoder_cfg'
I will do all the above, about not running, I was changing some things right now, now it should run, though it is not finished yet
ok looking at the config further, it seems very different from clip configs could you add a config looking as much as possible like clip ViT-B/32 so we could compare coca vs clip easily ?
sure, is it ok if I keep the coca_base.json
to have the reference for the smallest they train in the coca paper, and make a coca_vit_32.json
I added a test, so we will see directly in github if it runs or not
let's check if gradient checkpointing is working properly here, it seems the memory usage is very high for a quick run I did
ok yeah it's not about gradient checkpointing. with local batch size of 1, it OOM on 80GB a100
ok yeah it's not about gradient checkpointing. with local batch size of 1, it OOM on 80GB a100
I see it running with the same config as the test, https://wandb.ai/gpucce/open-clip/runs/e0zy499e, I must be doing smth wrong but can't understand what at the moment
ah I ran it in multi-node mode, could there be something wrong with distribution ?
ah I ran it in multi-node mode, could there be something wrong with distribution ?
yeah that could indeed be I am not 100% confident about it and I can't test it, let me try to have a look, could you try a run with local_loss=True
EDIT: let me check first anyway
@rom1504 and also in single node I see it running https://wandb.ai/gpucce/open-clip/runs/1s3zlgri
Is local loss still propagated via cli args ? If yes I was already using that
https://github.com/mlfoundations/open_clip/pull/256/files#diff-1a5488b81fc901e492e5cdbb6c1f99fd9618d19909adf226b748674dbafab318R200 looks like yes, so indeed there's an issue with memory
The gh action test failing seems to be showing the same problem of memory