BLIP Larger model

Awesome work, thanks for releasing!

Is there some plans to further release larger models, such as BLIP-large or BLIP-xxlarge?

Feb 07 '22 03:02 yixuan-qiao

Hi, we have released a larger model which use ViT-L as the vision encoder (the text encoder is still bert-base). Currently we do not have plans to train models that are larger than that.

Thanks!

Feb 07 '22 03:02 LiJunnan1992

In case you change your mind, we from LAION can provide compute & have 6B yet unreleased image-text-pairs, 2.3B english. ( https://laion.ai )

We are currently busy with preparing the training of CLIP-versions, but we could just scale the ViT & LM up with the existing code and cooperate on pulling off the training.

Btw, here is a colab with pretty impressive captioning results i got with BLIP with many cannidate captions and filtering with CLIP ViT L & ResNet 50x64 https://colab.research.google.com/drive/1fKxiDMa-9uu1A6XiYjxTbYxSagvbZ8Fb?usp=sharing

Feb 12 '22 08:02 christophschuhmann

Hi @christophschuhmann, it would be great if we can cooperate to train larger BLIP models with our code and your data & compute. I am very interested to continue this discussion.

Thanks for the colab, the captions do look nice!

Feb 12 '22 08:02 LiJunnan1992

Awesome! :)

We mostly use discord for correspondence. My handle is: spirit-from-germany#1488

Here is an invite link to the server we work on:

https://discord.gg/AAwcPAw894

For the Image captioning and VQA stuff, we use the channel #image-captioning.

Let's chat there :)

Btw, here are some VQA results we recently got with a frozen CLIP ViT L 14 and a frozen GPT J and a trained mapping transformer in between: