transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add interpolation of position encodings to BLIP-2

Open akkikiki opened this issue 1 year ago • 4 comments

Feature request

ViT implemented in Huggingface Transformers has the feature to enable finetuning with different resolution of images https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTModel.forward.interpolate_pos_encoding while the newly implemented BLIP-2 model does not. Would like to add those following the ViT implementation.

Motivation

I was playing around with the model whether different (mainly higher) resolution of input images helps downstream tasks.

(Curious to get feedback on whether this feature would be needed or not for the sake of keeping the code simple.)

Your contribution

It's mostly copying & pasting from the ViT implementation interpolate_pos_encoding but have a working code ready and ready for PR to get reviewed (and address bugs).

akkikiki avatar Mar 28 '23 02:03 akkikiki

it is good if the clip-pretrained model has the interpolate_pos_encoding like vit,

Ucas-HaoranWei avatar Apr 22 '23 13:04 Ucas-HaoranWei

@amyeroberts Shall I open a pull request? Have one handy.

akkikiki avatar Apr 25 '23 21:04 akkikiki

Hi @akkikiki, thanks for opening this issue!

interpolate_pos_encoding was added to the ViT model the enable cross-loading of DINO weights into the architecture. In general, we try and keep the forward passes of the models as simple as possible (few if/else branches). As such, it's not something that we'll be adding to the model at the moment. Let's keep this issue open, if there's many requests for it from the community (I'll measure with 👍 on your issue description) then we can revisit.

If you have your own fork with these changes, feel free to share here so others can see and benefit from your work.

amyeroberts avatar Apr 26 '23 09:04 amyeroberts

Sounds good! +1 to follow the "keep it simple (and stupid)" principle.

akkikiki avatar Apr 26 '23 16:04 akkikiki