LAVIS BLIP and BLIP2 model details

BLIP and BLIP2 model details

Open jonathan-roberts1 opened this issue 1 year ago • 1 comments

Great work! Thanks for releasing the models, pre-trained checkpoints and examples.

I have a couple of questions about the BLIP and BLIP 2 models - I can't clearly find this info in the papers/repo but apologies if I've missed it.

For the ‘base’ BLIP feature extractor: a. How many images were used during pretraining (is it the 14M or 129M version)? b. Is this the ViT-B/16 vision transformer version?
For the ‘pretrain’ and ‘coco’ BLIP 2 feature extractors: a. Which image encoder was used? (I’m assuming ViT-L or ViT-G) b. Roughly, how many images were used during the COCO fine-tuning?

Thanks!

Feb 27 '23 19:02 jonathan-roberts1