BLIP
BLIP copied to clipboard
is BLIP w/ ViT-L and CapFilt-L model for image captioning exist?
Hi,
At first I would like to say thank you for your great work which inspires me a lot.
I would like to know, is a BLIP w/ ViT-L + CapFilt-L model (use vit large as encoder and CapFilt for data augment) exist? I believe it should be stronger compared with BLIP w/ ViT-B + CapFilt-L and BLIP w/ ViT-L.
Thanks
Thanks for your question. BLIP w/ ViT-L already uses CapFilt-L model.
Thanks!