LAVIS Use pretrained Q-Former with multiple image resolutions

Use pretrained Q-Former with multiple image resolutions

Open david-az opened this issue 1 year ago • 0 comments

In the BLIP-2 paper, it is specified that: "[Q-Former] extracts a fixed number of output features from the image encoder, independent of input image resolution.".

However, when using cross-attention, this doesn't seem possible since it's using encoder_width which is fixed. I want to use Q-Former with a Pyramid Vision Transformer as the frozen image encoder. PVT handles multiple image resolutions and does not output a fixed feature resolution.

Is there a way to use cross-attention in that case ?

May 25 '23 15:05 david-az

LAVIS LAVIS copied to clipboard

Use pretrained Q-Former with multiple image resolutions

LAVIS
LAVIS copied to clipboard