transformers
transformers copied to clipboard
Add interpolation of position encodings to BLIP-2
Feature request
ViT implemented in Huggingface Transformers has the feature to enable finetuning with different resolution of images https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTModel.forward.interpolate_pos_encoding while the newly implemented BLIP-2 model does not. Would like to add those following the ViT implementation.
Motivation
I was playing around with the model whether different (mainly higher) resolution of input images helps downstream tasks.
(Curious to get feedback on whether this feature would be needed or not for the sake of keeping the code simple.)
Your contribution
It's mostly copying & pasting from the ViT implementation interpolate_pos_encoding
but have a working code ready and ready for PR to get reviewed (and address bugs).
it is good if the clip-pretrained model has the interpolate_pos_encoding like vit,
@amyeroberts Shall I open a pull request? Have one handy.
Hi @akkikiki, thanks for opening this issue!
interpolate_pos_encoding
was added to the ViT model the enable cross-loading of DINO weights into the architecture. In general, we try and keep the forward passes of the models as simple as possible (few if/else branches). As such, it's not something that we'll be adding to the model at the moment. Let's keep this issue open, if there's many requests for it from the community (I'll measure with 👍 on your issue description) then we can revisit.
If you have your own fork with these changes, feel free to share here so others can see and benefit from your work.
Sounds good! +1 to follow the "keep it simple (and stupid)" principle.