LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

[BLIP2] The number of layers in ViT-L is confusing.

Open Kamino666 opened this issue 1 year ago • 1 comments

Thank you for this amazing work and the releasing codes.

According to the description in your paper, the ViT-L used in BLIP2 should be 23 layers.

For the frozen image encoder, we explore two state-of-the-art pre-trained vision transformer models: (1) ViT-L/14 from CLIP (Radford et al., 2021) and (2) ViT-G/14 from EVA-CLIP (Fang et al., 2022). We remove the last layer of the ViT and uses the second last layer’s output features, which leads to slightly better performance

However, it is set to 22 layers in the code, which should be 23.

https://github.com/salesforce/LAVIS/blob/36d1e988caa977c96d4fe237a61845ec5053b3df/lavis/models/clip_vit.py#L233-L240

What confuses me even more is that you have removed another layer during building the model, so there are actually only 21 layers in use.

https://github.com/salesforce/LAVIS/blob/36d1e988caa977c96d4fe237a61845ec5053b3df/lavis/models/clip_vit.py#L162-L167

Is this a bug?

Kamino666 avatar Mar 24 '23 07:03 Kamino666