LAVIS
LAVIS copied to clipboard
[BLIP2] The number of layers in ViT-L is confusing.
Thank you for this amazing work and the releasing codes.
According to the description in your paper, the ViT-L used in BLIP2 should be 23 layers.
For the frozen image encoder, we explore two state-of-the-art pre-trained vision transformer models: (1) ViT-L/14 from CLIP (Radford et al., 2021) and (2) ViT-G/14 from EVA-CLIP (Fang et al., 2022). We remove the last layer of the ViT and uses the second last layer’s output features, which leads to slightly better performance
However, it is set to 22 layers in the code, which should be 23.
https://github.com/salesforce/LAVIS/blob/36d1e988caa977c96d4fe237a61845ec5053b3df/lavis/models/clip_vit.py#L233-L240
What confuses me even more is that you have removed another layer during building the model, so there are actually only 21 layers in use.
https://github.com/salesforce/LAVIS/blob/36d1e988caa977c96d4fe237a61845ec5053b3df/lavis/models/clip_vit.py#L162-L167
Is this a bug?