transformers Uniform kwargs for processors

Feature request

We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a TypedDict for each modality to keep track of which kwargs are accepted.

The initial design is merged and an example model is modified to follow the new uniform processor kwargs in https://github.com/huggingface/transformers/pull/31198. Also https://github.com/huggingface/transformers/pull/31197 has two more examples with standardized API.

This design has to be shipped to all the processors in Transformers, and appreciate contributions. Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:

[x] Align #31368
[x] AltClip #31368
[x] BLIP #31368
[x] BLIP-2 #31368
[x] Bridgetower #31368
[x] Chameleon -> https://github.com/huggingface/transformers/pull/32181
[x] Chinese CLIP #31368
[x] CLIP -> in progress by @davidgxue
[ ] ClipSeg
[x] Donut #31368
[x] Fuyu -> https://github.com/huggingface/transformers/pull/32544
[x] Grounding DINO #31964
[x] Idefics -> https://github.com/huggingface/transformers/pull/32568
[x] Idefics-2 -> https://github.com/huggingface/transformers/pull/32568
[x] InstructBlip -> https://github.com/huggingface/transformers/pull/32544
[ ] InstructBlipVideo
[x] Kosmos-2 -> https://github.com/huggingface/transformers/pull/32544
[x] LayoutLM (1, 2, 3) -> https://github.com/huggingface/transformers/pull/32180
[x] LLaVa -> https://github.com/huggingface/transformers/pull/32858
[x] LLaVa-NeXT -> https://github.com/huggingface/transformers/pull/32544
[ ] LLaVa-NeXT-Video
[x] Nouga -> https://github.com/huggingface/transformers/pull/32841
[ ] Owlv2
[x] SigLip -> https://github.com/huggingface/transformers/pull/32842
[x] Paligemma -> https://github.com/huggingface/transformers/pull/32377
[x] Pix2Struct -> https://github.com/huggingface/transformers/pull/32544
[x] Udop -> https://github.com/huggingface/transformers/pull/32544
[ ] VideoLLaVa

Note: For now we'll start with image or image+text, https://github.com/huggingface/transformers/pull/31368 is an ongoing PR that has also audio processor standardization

Motivation

.

Your contribution

.

Jul 11 '24 13:07 zucchini-nlp

cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline

Jul 11 '24 13:07 zucchini-nlp

I can take CLIP and LLaVa

Jul 11 '24 18:07 davidgxue

@davidgxue okey, feel free to open a PR when it's ready.

Jul 12 '24 04:07 zucchini-nlp

I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions

# Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__

Jul 12 '24 15:07 OmarManzoor

@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :)

I'll add Blip to the list and assign to you then

Jul 12 '24 15:07 zucchini-nlp

@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled

Jul 15 '24 08:07 molbap

Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled

Thank you for clarifying.

Jul 15 '24 08:07 OmarManzoor

@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these.

If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know!

Jul 19 '24 15:07 zucchini-nlp

thanks @zucchini-nlp!

I can take LayoutLM (1, 2, 3) & Chameleon

Jul 19 '24 16:07 leloykun

I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?

# Copied from transformers.models.owlvit.processing_owlvit.OwlViTProcessor.__call__ with OWLViT->OWLv2

Jul 28 '24 02:07 bhuvanmdev

Can I take dino and Paligemma if no ones working on it?

Jul 28 '24 21:07 MnCSSJ4x

@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from

@MnCSSJ4x sure

Jul 29 '24 05:07 zucchini-nlp

@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there.

Jul 30 '24 16:07 MnCSSJ4x

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

Jul 31 '24 05:07 zucchini-nlp

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same.

Aug 01 '24 13:08 MnCSSJ4x

I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for https://github.com/huggingface/transformers/pull/32471 .

Aug 07 '24 02:08 yonigozlan

@yonigozlan Thanks, that would be great!

Aug 07 '24 04:08 zucchini-nlp

Also added Udop to the list and to #32544 .

Aug 09 '24 23:08 yonigozlan

Hey, I can take ClipSeg.

Aug 13 '24 18:08 Nech-C

Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!

Aug 14 '24 13:08 yonigozlan

Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!

Hey. I had written a draft PR for Paligemma. I was blocked with some curriculum work due to which I wasn't able to resolve the comments as well as write the test suite. You can start LLaVa if no one else has taken it. Also, feel free to ping me in that PR as I also need some help in writing tests and understanding the reference codebase.

Aug 14 '24 13:08 MnCSSJ4x

@yonigozlan I am so sorry about the delay. I made some progress locally a while ago, but then completely forgot about this github issue. Will try to get it out today or tomorrow. Will keep you posted if anything changes.

Aug 14 '24 14:08 davidgxue

Hey @yonigozlan actually, feel free to work on LLAVA. I realized I previously made progress on CLIP not LLAVA. And I am also getting a lot of work lately, so probably won't be able to help out on that one. Sorry about that! CLIP will come around tho!

Aug 16 '24 04:08 davidgxue

Since nobody has claimed them yet, dibs on Nougat and SigLip

They should now be ready for review + they already have backwards compatibility support

Aug 16 '24 08:08 leloykun

Btw, DepthAnything, Dino, Maskformer, & VIT don't have processors

Aug 16 '24 08:08 leloykun

@zucchini-nlp here's for the rest of the processors: https://github.com/huggingface/transformers/pull/32845

Aug 16 '24 09:08 leloykun

@leloykun WOW, thanks a lot! I can review those on Monday, today will be a bit busy

Aug 16 '24 09:08 zucchini-nlp

Thanks too!

For now, the PR for Chameleon, https://github.com/huggingface/transformers/pull/32181, is the safest to merge as (1) the processor doesn't expect special args (e.g. text_pair and such) and (2) the PR also already has tests

The PRs for Nougat https://github.com/huggingface/transformers/pull/32841 and the LayoutLM models https://github.com/huggingface/transformers/pull/32180 need more thought as their processors expect special args (cc @yonigozlan, I think @molbap is right that our current implementation is kinda wonky)

while the other PRs don't have tests yet

Aug 16 '24 11:08 leloykun

No problem @davidgxue! I will get started on LLaVa then

Aug 16 '24 14:08 yonigozlan

Summary of my progress:

Model	Status	Has Tests?	Special Args	PR
Chameleon	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32181
---	---	---	---	---
AltCLIP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
Flava	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
Git	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
InstructBlipVideo	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
LLaVa-NeXT-Video	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
MGP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
Siglip	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
TVP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
VideoLLaVa	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
VILT	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
X-CLIP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
---	---	---	---	---
LayoutLMv2	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	https://github.com/huggingface/transformers/pull/32180
LayoutLMv3	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	https://github.com/huggingface/transformers/pull/32180
LayoutXLM	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	https://github.com/huggingface/transformers/pull/32180
---	---	---	---	---
ClipSeg	Ready	Yes	`visual_prompt`	https://github.com/huggingface/transformers/pull/32841
Nougat	Ready	Yes	~~`text_pair`, `text_target`, `text_pair_target`~~(apparently these are in the tokenizer base class)	https://github.com/huggingface/transformers/pull/32841
OwlV2	Ready	Yes	`query_images`	https://github.com/huggingface/transformers/pull/32841
OwlVIT	Ready	Yes	`query_images`	https://github.com/huggingface/transformers/pull/32841
---	---	---	---	---
Clap	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
CLVP	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
MusicGen Melody	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
PopPiano	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
Qwen2 Audio	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
Seamless M4T	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
SpeechT5	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32906
Wav2Vec2 Bert	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32906

~~https://github.com/huggingface/transformers/pull/32181 & #32845 are now ready for review. I could also decouple InstructBlipVideo, LLaVa-NeXT-Video, Siglip, and VideoLLaVa to a separate PR just to get them out--just lemme know if there's a need to rush.~~

~~The rest have special args that we still have to figure out how to handle. cc @yonigozlan~~

Update: all of the PRs are now ready for review

Aug 16 '24 16:08 leloykun