LAVIS
LAVIS copied to clipboard
BLIP and BLIP2 model details
Great work! Thanks for releasing the models, pre-trained checkpoints and examples.
I have a couple of questions about the BLIP and BLIP 2 models - I can't clearly find this info in the papers/repo but apologies if I've missed it.
- For the ‘base’ BLIP feature extractor: a. How many images were used during pretraining (is it the 14M or 129M version)? b. Is this the ViT-B/16 vision transformer version?
- For the ‘pretrain’ and ‘coco’ BLIP 2 feature extractors: a. Which image encoder was used? (I’m assuming ViT-L or ViT-G) b. Roughly, how many images were used during the COCO fine-tuning?
Thanks!