LAVIS [BLIP2] The number of layers in ViT-L is confusing.

[BLIP2] The number of layers in ViT-L is confusing.

Open Kamino666 opened this issue 1 year ago • 1 comments

Thank you for this amazing work and the releasing codes.

According to the description in your paper, the ViT-L used in BLIP2 should be 23 layers.

For the frozen image encoder, we explore two state-of-the-art pre-trained vision transformer models: (1) ViT-L/14 from CLIP (Radford et al., 2021) and (2) ViT-G/14 from EVA-CLIP (Fang et al., 2022). We remove the last layer of the ViT and uses the second last layer’s output features, which leads to slightly better performance

However, it is set to 22 layers in the code, which should be 23.

https://github.com/salesforce/LAVIS/blob/36d1e988caa977c96d4fe237a61845ec5053b3df/lavis/models/clip_vit.py#L233-L240

What confuses me even more is that you have removed another layer during building the model, so there are actually only 21 layers in use.

https://github.com/salesforce/LAVIS/blob/36d1e988caa977c96d4fe237a61845ec5053b3df/lavis/models/clip_vit.py#L162-L167

Is this a bug?

Mar 24 '23 07:03 Kamino666

LAVIS LAVIS copied to clipboard

[BLIP2] The number of layers in ViT-L is confusing.

LAVIS
LAVIS copied to clipboard