LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

Question for the meaning of BLIP2 embedding.

Open Roberyan opened this issue 4 months ago • 0 comments

Hi, I want to know if the image feature and multi modal feature has position meaning to the original image?

Like the blip2_feature_extractor produce (1,32,768) for both image feature and multi modal feature, are they corresponding to the same patch? And does the patch follow the image order like from 0 to 31, it is corresponding to vision encoder's split?

Roberyan avatar Oct 16 '24 06:10 Roberyan