is BLIP w/ ViT-L and CapFilt-L model for image captioning exist?

Open 4thfever opened this issue 3 years ago • 2 comments

Hi,

At first I would like to say thank you for your great work which inspires me a lot.

I would like to know, is a BLIP w/ ViT-L + CapFilt-L model (use vit large as encoder and CapFilt for data augment) exist? I believe it should be stronger compared with BLIP w/ ViT-B + CapFilt-L and BLIP w/ ViT-L.

Thanks

Dec 29 '22 08:12 4thfever

Thanks for your question. BLIP w/ ViT-L already uses CapFilt-L model.

Dec 30 '22 07:12 LiJunnan1992

Thanks!

Dec 30 '22 07:12 4thfever