LAVIS
LAVIS copied to clipboard
Use pretrained Q-Former with multiple image resolutions
In the BLIP-2 paper, it is specified that: "[Q-Former] extracts a fixed number of output features from the image encoder, independent of input image resolution.".
However, when using cross-attention, this doesn't seem possible since it's using encoder_width
which is fixed.
I want to use Q-Former with a Pyramid Vision Transformer as the frozen image encoder. PVT handles multiple image resolutions and does not output a fixed feature resolution.
Is there a way to use cross-attention in that case ?