transformers Community contribution: enable dynamic resolution input for more vision models.

Feature request

Some of our models interpolate its positional embeddings, enabling pretrained checkpoints to be used on different input resolutions. For example, here in ViT.

[ ] beit & data2vec
[ ] blip, blip_2 @zafstojano #30722
[ ] clip and clip related models: altclip, bridgetower, chinese_clip, git, kosmos2
[ ] deit - possible PR in #30649 (to be split up)
[ ] detr, conditional_detr, deformable_detr, table_transformer
[ ] grounding_dino
[ ] owlvit, owlv2 @yMayanand
[ ] perceiver
[ ] siglip @davidgxue https://github.com/huggingface/transformers/pull/30719
[ ] swin, donut, maskformer swin, swinv2 #30656 @the-neural-networker
[ ] swin2sr
[ ] tvp @bhuvanmdev
[ ] vit_mae #30657 @bhuvanmdev
[ ] vivit #30630 @jla524

Motivation

Let's add this to more models, to leverage existing checkpoints for new cases!

Your contribution

For anyone who would like to contribute, please comment on the issue, claiming a model you'd like to work on and share a link to the PR.

Each PR should:

Add an interpolate_pos_encoding method
Add a test showing the model can correctly interpolate an input image of a different size

There was a PR opened to add this to CLIP models, which is now inactive, but useful for reference of the changes to make: https://github.com/huggingface/transformers/pull/27457

Once the PR is ready, you can ping me for review 🤗

Apr 30 '24 17:04 amyeroberts

I can take Clip and Blip2.

Apr 30 '24 19:04 ashvinnihalani

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images. I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

Apr 30 '24 19:04 NielsRogge

i can work on vit_mae and tvp

May 01 '24 03:05 bhuvanmdev

Thanks for the heads up @NielsRogge!

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images

OK, that's good to know. If many models have this it's a good reason to spend some time to figure out a solution! The most important thing is that it will work with a standard forward / backwards pass - if that's working we should be able to find a way to integrate if it's a wanted feature.

I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

Agreed interpolate_position_embeddings would have been better originally. Now interpolate_pos_encoding is implemented across the library I'd say it's better to stick with it to be consistent.

May 01 '24 09:05 amyeroberts

Yes so the problem is that the Trainer does not allow to pass any keyword arguments to the forward of a model.

However, there's a workaround: https://discuss.huggingface.co/t/fine-tuning-vit-with-more-patches-higher-resolution/18731/4?u=nielsr

May 01 '24 16:05 NielsRogge

I can work on deit!

May 03 '24 01:05 the-neural-networker

I'd like to work on vivit

May 03 '24 06:05 jla524

I can take Clip and Blip2.

Hi ashavinni i am new to open source , can you help me little to get started with this task

May 03 '24 18:05 faiez22222

I can work on chinese_clip. Will keep the team posted in the next few days. If I get more free time and there are remaining ones by then, happy to help out on additional tasks.

May 03 '24 19:05 davidgxue

Working on detr, a bit tricky. Will explain in the PR.

May 03 '24 21:05 g1y5x3

Actually, I can also take bridgetower as well. They will come in as separate PRs though. Shouldn't be more complicated than chinese_clip. So recap: I will work on both bridgetower and chinese_clip.

May 03 '24 23:05 davidgxue

@amyeroberts ,

How you manage this with make fix-copies , as most of the models are copied from CLIP and eventually we end up changing models that others have claimed for . I did change Git but that is copied from CLIP and that inturn triggers cascading changes.

Or avoid `make fix-copies' altogether before sending a PR?

May 04 '24 07:05 nileshkokane01

I will work on Swin, since DeiT is already implemented.

May 05 '24 04:05 the-neural-networker

I will work on owlvit.

May 05 '24 12:05 yMayanand

@nileshkokane01 This is a good point - I'll update the list of models to indicates models which are "grouped" together. In the case of e.g. the CLIP family, there should just be one PR opened for adding the feature to CLIP and the models which are copied from it. The steps would be:

Make the changes for CLIP
Run make fix-copies to propogate to models which copy from CLIP
Update those models so feature is properly applied to all the models
Add tests for all the affected models

May 07 '24 15:05 amyeroberts

@nileshkokane01 @amyeroberts In that case, I will refrain from working on chinese_clip and bridgetower since both have # Copied from transformers.models.clip.modeling_clip.CLIPVisionEmbeddings with CLIP in the comments. I think Kosmos 2 may also be copied from CLIP. Most likely a fair amount on the list will be inheriting from CLIP (just as a heads up to other folks)

Update: oh nice thank you Amy for updating the description to group them

May 07 '24 15:05 davidgxue

I can take siglip. I think some functions are still copied from CLIP but just skimming it, doesn't seem like they will be related to interpolate position encoding code

May 07 '24 15:05 davidgxue

@amyeroberts Doesn't idefics2 already handle this?

https://github.com/huggingface/transformers/blob/cf7bed98325a0be9d195cb6b66c6a0bef9fccbc8/src/transformers/models/idefics2/modeling_idefics2.py#L139-L149

For example, the following sample script:

import torch
import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

url = "https://upload.wikimedia.org/wikipedia/commons/c/cc/ESC_large_ISS022_ISS022-E-11387-edit_01.JPG"
images = [Image.open(requests.get(url, stream=True).raw)]
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What's on the image?"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
# Instead of the default 980, allow the largest edge to be 1500
processor.image_processor.size["longest_edge"] = 1500 

model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b").to(device)

text = processor.apply_chat_template(messages)
inputs = processor(text=text, images=images, return_tensors="pt", padding=True)
for k, v in inputs.items():
    inputs[k] = v.to(device)

print("Input image shape:", inputs["pixel_values"].shape)

with torch.no_grad():
    out = model(**inputs)

print("Finished!")

Executes without errors and prints the following:

Loading checkpoint shards: 100%|████████████████████████| 7/7 [00:03<00:00,  2.26it/s]
Input image shape: torch.Size([1, 1, 3, 994, 1500])
Finished!

May 07 '24 17:05 zafstojano

Since all clip like models can just borrow changes made to clip model, I will take tvp instead of altclip.

May 08 '24 03:05 bhuvanmdev

@zafstojano Indeed! That's what I get for doing a quick grep and not double checking. Thanks for showing an example to verify. I'll take it off the list

May 08 '24 09:05 amyeroberts

Completed the PR #30719. I actually realized: Should I be referencing this issue directly in my PR? Because if any of our PRs merge then it may end up closing this issue. Should we make a child issue stemming from this instead?

May 08 '24 18:05 davidgxue

@davidgxue Good question! If instead of having 'Fixes #xxx' the PR says something else 'Addresses #xxx' or just mentions this issue then it will be linked and the issue won't be closed upon merge. It's not a big deal if it's closed accidentally, other than additional notifications as I can just re-open it.

May 08 '24 19:05 amyeroberts

Opened a PR (#30722) addressing this issue for the BLIP family of models (BLIP, BLIP2, InstructBLIP).

May 08 '24 20:05 zafstojano

@amyeroberts I would like to work on DETR. Is anyone working on it?

May 20 '24 13:05 kyrajeep

@amyeroberts I would like to work on DETR. Is anyone working on it?

I'm almost done. Was busy with work in the past 2 weeks.

May 20 '24 14:05 g1y5x3

I'll be working on grounding_dino and hopefuly I will have a PR soon.

May 21 '24 14:05 M-Ali-ML

@MightyStud Thanks for picking a model and working to add this feature! After reviewing #30921, I realised that this isn't something we can add for models with backbones, which includes grounding DINO and DETR related models. I've updated the list to reflect this.

May 21 '24 19:05 amyeroberts

@amyeroberts Aha, thanks for letting me know, I'd like to work on swin2sr then since I already allocated time this week.

May 21 '24 21:05 M-Ali-ML

Hi @amyeroberts Can I try out beit or data2vec?

May 23 '24 11:05 OmarManzoor

@OmarManzoor Certainly!

May 23 '24 12:05 amyeroberts

transformers transformers copied to clipboard

Community contribution: enable dynamic resolution input for more vision models.

Feature request

Motivation

Your contribution

transformers
transformers copied to clipboard