Uniform kwargs for processors
Feature request
We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a TypedDict for each modality to keep track of which kwargs are accepted.
The initial design is merged and an example model is modified to follow the new uniform processor kwargs in https://github.com/huggingface/transformers/pull/31198. Also https://github.com/huggingface/transformers/pull/31197 has two more examples with standardized API.
This design has to be shipped to all the processors in Transformers, and appreciate contributions. Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:
- [x] Align #31368
- [x] AltClip #31368
- [x] BLIP #31368
- [x] BLIP-2 #31368
- [x] Bridgetower #31368
- [x] Chameleon -> https://github.com/huggingface/transformers/pull/32181
- [x] Chinese CLIP #31368
- [x] CLIP -> in progress by @davidgxue
- [ ] ClipSeg
- [x] Donut #31368
- [x] Fuyu -> https://github.com/huggingface/transformers/pull/32544
- [x] Grounding DINO #31964
- [x] Idefics -> https://github.com/huggingface/transformers/pull/32568
- [x] Idefics-2 -> https://github.com/huggingface/transformers/pull/32568
- [x] InstructBlip -> https://github.com/huggingface/transformers/pull/32544
- [ ] InstructBlipVideo
- [x] Kosmos-2 -> https://github.com/huggingface/transformers/pull/32544
- [x] LayoutLM (1, 2, 3) -> https://github.com/huggingface/transformers/pull/32180
- [x] LLaVa -> https://github.com/huggingface/transformers/pull/32858
- [x] LLaVa-NeXT -> https://github.com/huggingface/transformers/pull/32544
- [ ] LLaVa-NeXT-Video
- [x] Nouga -> https://github.com/huggingface/transformers/pull/32841
- [ ] Owlv2
- [x] SigLip -> https://github.com/huggingface/transformers/pull/32842
- [x] Paligemma -> https://github.com/huggingface/transformers/pull/32377
- [x] Pix2Struct -> https://github.com/huggingface/transformers/pull/32544
- [x] Udop -> https://github.com/huggingface/transformers/pull/32544
- [ ] VideoLLaVa
Note: For now we'll start with image or image+text, https://github.com/huggingface/transformers/pull/31368 is an ongoing PR that has also audio processor standardization
Motivation
.
Your contribution
.
cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline
I can take CLIP and LLaVa
@davidgxue okey, feel free to open a PR when it's ready.
I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions
# Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__
@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :)
I'll add Blip to the list and assign to you then
@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled
Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled
Thank you for clarifying.
@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these.
If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know!
thanks @zucchini-nlp!
I can take LayoutLM (1, 2, 3) & Chameleon
I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?
# Copied from transformers.models.owlvit.processing_owlvit.OwlViTProcessor.__call__ with OWLViT->OWLv2
Can I take dino and Paligemma if no ones working on it?
@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from
@MnCSSJ4x sure
@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there.
@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested
@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and
ProcessorTesterMixinto it, so that new changes are all tested
Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same.
I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for https://github.com/huggingface/transformers/pull/32471 .
@yonigozlan Thanks, that would be great!
Also added Udop to the list and to #32544 .
Hey, I can take ClipSeg.
Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!
Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!
Hey. I had written a draft PR for Paligemma. I was blocked with some curriculum work due to which I wasn't able to resolve the comments as well as write the test suite. You can start LLaVa if no one else has taken it. Also, feel free to ping me in that PR as I also need some help in writing tests and understanding the reference codebase.
@yonigozlan I am so sorry about the delay. I made some progress locally a while ago, but then completely forgot about this github issue. Will try to get it out today or tomorrow. Will keep you posted if anything changes.
Hey @yonigozlan actually, feel free to work on LLAVA. I realized I previously made progress on CLIP not LLAVA. And I am also getting a lot of work lately, so probably won't be able to help out on that one. Sorry about that! CLIP will come around tho!
Since nobody has claimed them yet, dibs on Nougat and SigLip
They should now be ready for review + they already have backwards compatibility support
Btw, DepthAnything, Dino, Maskformer, & VIT don't have processors
@zucchini-nlp here's for the rest of the processors: https://github.com/huggingface/transformers/pull/32845
@leloykun WOW, thanks a lot! I can review those on Monday, today will be a bit busy
Thanks too!
For now, the PR for Chameleon, https://github.com/huggingface/transformers/pull/32181, is the safest to merge as (1) the processor doesn't expect special args (e.g. text_pair and such) and (2) the PR also already has tests
The PRs for Nougat https://github.com/huggingface/transformers/pull/32841 and the LayoutLM models https://github.com/huggingface/transformers/pull/32180 need more thought as their processors expect special args (cc @yonigozlan, I think @molbap is right that our current implementation is kinda wonky)
while the other PRs don't have tests yet
No problem @davidgxue! I will get started on LLaVa then
Summary of my progress:
| Model | Status | Has Tests? | Special Args | PR |
|---|---|---|---|---|
| Chameleon | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32181 |
| --- | --- | --- | --- | --- |
| AltCLIP | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| Flava | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| Git | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| InstructBlipVideo | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| LLaVa-NeXT-Video | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| MGP | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| Siglip | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| TVP | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| VideoLLaVa | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| VILT | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| X-CLIP | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32845 |
| --- | --- | --- | --- | --- |
| LayoutLMv2 | Ready | Yes | ~~text_pair~~, boxes, word_labels |
https://github.com/huggingface/transformers/pull/32180 |
| LayoutLMv3 | Ready | Yes | ~~text_pair~~, boxes, word_labels |
https://github.com/huggingface/transformers/pull/32180 |
| LayoutXLM | Ready | Yes | ~~text_pair~~, boxes, word_labels |
https://github.com/huggingface/transformers/pull/32180 |
| --- | --- | --- | --- | --- |
| ClipSeg | Ready | Yes | visual_prompt |
https://github.com/huggingface/transformers/pull/32841 |
| Nougat | Ready | Yes | ~~text_pair, text_target, text_pair_target~~(apparently these are in the tokenizer base class) |
https://github.com/huggingface/transformers/pull/32841 |
| OwlV2 | Ready | Yes | query_images |
https://github.com/huggingface/transformers/pull/32841 |
| OwlVIT | Ready | Yes | query_images |
https://github.com/huggingface/transformers/pull/32841 |
| --- | --- | --- | --- | --- |
| Clap | Not Ready | No | ?? | https://github.com/huggingface/transformers/pull/32906 |
| CLVP | Not Ready | No | ?? | https://github.com/huggingface/transformers/pull/32906 |
| MusicGen Melody | Not Ready | No | ?? | https://github.com/huggingface/transformers/pull/32906 |
| PopPiano | Not Ready | No | ?? | https://github.com/huggingface/transformers/pull/32906 |
| Qwen2 Audio | Not Ready | No | ?? | https://github.com/huggingface/transformers/pull/32906 |
| Seamless M4T | Not Ready | No | ?? | https://github.com/huggingface/transformers/pull/32906 |
| SpeechT5 | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32906 |
| Wav2Vec2 Bert | Ready | Yes | - | https://github.com/huggingface/transformers/pull/32906 |
~~https://github.com/huggingface/transformers/pull/32181 & #32845 are now ready for review. I could also decouple InstructBlipVideo, LLaVa-NeXT-Video, Siglip, and VideoLLaVa to a separate PR just to get them out--just lemme know if there's a need to rush.~~
~~The rest have special args that we still have to figure out how to handle. cc @yonigozlan~~
Update: all of the PRs are now ready for review