BLIP
BLIP copied to clipboard
Larger model
Awesome work, thanks for releasing!
Is there some plans to further release larger models, such as BLIP-large or BLIP-xxlarge?
Hi, we have released a larger model which use ViT-L as the vision encoder (the text encoder is still bert-base). Currently we do not have plans to train models that are larger than that.
Thanks!
In case you change your mind, we from LAION can provide compute & have 6B yet unreleased image-text-pairs, 2.3B english. ( https://laion.ai )
We are currently busy with preparing the training of CLIP-versions, but we could just scale the ViT & LM up with the existing code and cooperate on pulling off the training.
Btw, here is a colab with pretty impressive captioning results i got with BLIP with many cannidate captions and filtering with CLIP ViT L & ResNet 50x64 https://colab.research.google.com/drive/1fKxiDMa-9uu1A6XiYjxTbYxSagvbZ8Fb?usp=sharing
Hi @christophschuhmann, it would be great if we can cooperate to train larger BLIP models with our code and your data & compute. I am very interested to continue this discussion.
Thanks for the colab, the captions do look nice!
Awesome! :)
We mostly use discord for correspondence. My handle is: spirit-from-germany#1488
Here is an invite link to the server we work on:
https://discord.gg/AAwcPAw894
For the Image captioning and VQA stuff, we use the channel #image-captioning.
Let's chat there :)
Btw, here are some VQA results we recently got with a frozen CLIP ViT L 14 and a frozen GPT J and a trained mapping transformer in between: