transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Uniform kwargs for processors

Open zucchini-nlp opened this issue 1 year ago • 43 comments

Feature request

We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a TypedDict for each modality to keep track of which kwargs are accepted.

The initial design is merged and an example model is modified to follow the new uniform processor kwargs in https://github.com/huggingface/transformers/pull/31198. Also https://github.com/huggingface/transformers/pull/31197 has two more examples with standardized API.

This design has to be shipped to all the processors in Transformers, and appreciate contributions. Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:

  • [x] Align #31368
  • [x] AltClip #31368
  • [x] BLIP #31368
  • [x] BLIP-2 #31368
  • [x] Bridgetower #31368
  • [x] Chameleon -> https://github.com/huggingface/transformers/pull/32181
  • [x] Chinese CLIP #31368
  • [x] CLIP -> in progress by @davidgxue
  • [ ] ClipSeg
  • [x] Donut #31368
  • [x] Fuyu -> https://github.com/huggingface/transformers/pull/32544
  • [x] Grounding DINO #31964
  • [x] Idefics -> https://github.com/huggingface/transformers/pull/32568
  • [x] Idefics-2 -> https://github.com/huggingface/transformers/pull/32568
  • [x] InstructBlip -> https://github.com/huggingface/transformers/pull/32544
  • [ ] InstructBlipVideo
  • [x] Kosmos-2 -> https://github.com/huggingface/transformers/pull/32544
  • [x] LayoutLM (1, 2, 3) -> https://github.com/huggingface/transformers/pull/32180
  • [x] LLaVa -> https://github.com/huggingface/transformers/pull/32858
  • [x] LLaVa-NeXT -> https://github.com/huggingface/transformers/pull/32544
  • [ ] LLaVa-NeXT-Video
  • [x] Nouga -> https://github.com/huggingface/transformers/pull/32841
  • [ ] Owlv2
  • [x] SigLip -> https://github.com/huggingface/transformers/pull/32842
  • [x] Paligemma -> https://github.com/huggingface/transformers/pull/32377
  • [x] Pix2Struct -> https://github.com/huggingface/transformers/pull/32544
  • [x] Udop -> https://github.com/huggingface/transformers/pull/32544
  • [ ] VideoLLaVa

Note: For now we'll start with image or image+text, https://github.com/huggingface/transformers/pull/31368 is an ongoing PR that has also audio processor standardization

Motivation

.

Your contribution

.

zucchini-nlp avatar Jul 11 '24 13:07 zucchini-nlp

cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline

zucchini-nlp avatar Jul 11 '24 13:07 zucchini-nlp

I can take CLIP and LLaVa

davidgxue avatar Jul 11 '24 18:07 davidgxue

@davidgxue okey, feel free to open a PR when it's ready.

zucchini-nlp avatar Jul 12 '24 04:07 zucchini-nlp

I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions

# Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__

OmarManzoor avatar Jul 12 '24 15:07 OmarManzoor

@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :)

I'll add Blip to the list and assign to you then

zucchini-nlp avatar Jul 12 '24 15:07 zucchini-nlp

@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled

molbap avatar Jul 15 '24 08:07 molbap

Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled

Thank you for clarifying.

OmarManzoor avatar Jul 15 '24 08:07 OmarManzoor

@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these.

If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know!

zucchini-nlp avatar Jul 19 '24 15:07 zucchini-nlp

thanks @zucchini-nlp!

I can take LayoutLM (1, 2, 3) & Chameleon

leloykun avatar Jul 19 '24 16:07 leloykun

I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?

# Copied from transformers.models.owlvit.processing_owlvit.OwlViTProcessor.__call__ with OWLViT->OWLv2

bhuvanmdev avatar Jul 28 '24 02:07 bhuvanmdev

Can I take dino and Paligemma if no ones working on it?

MnCSSJ4x avatar Jul 28 '24 21:07 MnCSSJ4x

@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from

@MnCSSJ4x sure

zucchini-nlp avatar Jul 29 '24 05:07 zucchini-nlp

@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there.

MnCSSJ4x avatar Jul 30 '24 16:07 MnCSSJ4x

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

zucchini-nlp avatar Jul 31 '24 05:07 zucchini-nlp

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same.

MnCSSJ4x avatar Aug 01 '24 13:08 MnCSSJ4x

I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for https://github.com/huggingface/transformers/pull/32471 .

yonigozlan avatar Aug 07 '24 02:08 yonigozlan

@yonigozlan Thanks, that would be great!

zucchini-nlp avatar Aug 07 '24 04:08 zucchini-nlp

Also added Udop to the list and to #32544 .

yonigozlan avatar Aug 09 '24 23:08 yonigozlan

Hey, I can take ClipSeg.

Nech-C avatar Aug 13 '24 18:08 Nech-C

Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!

yonigozlan avatar Aug 14 '24 13:08 yonigozlan

Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!

Hey. I had written a draft PR for Paligemma. I was blocked with some curriculum work due to which I wasn't able to resolve the comments as well as write the test suite. You can start LLaVa if no one else has taken it. Also, feel free to ping me in that PR as I also need some help in writing tests and understanding the reference codebase.

MnCSSJ4x avatar Aug 14 '24 13:08 MnCSSJ4x

@yonigozlan I am so sorry about the delay. I made some progress locally a while ago, but then completely forgot about this github issue. Will try to get it out today or tomorrow. Will keep you posted if anything changes.

davidgxue avatar Aug 14 '24 14:08 davidgxue

Hey @yonigozlan actually, feel free to work on LLAVA. I realized I previously made progress on CLIP not LLAVA. And I am also getting a lot of work lately, so probably won't be able to help out on that one. Sorry about that! CLIP will come around tho!

davidgxue avatar Aug 16 '24 04:08 davidgxue

Since nobody has claimed them yet, dibs on Nougat and SigLip

They should now be ready for review + they already have backwards compatibility support

leloykun avatar Aug 16 '24 08:08 leloykun

Btw, DepthAnything, Dino, Maskformer, & VIT don't have processors

leloykun avatar Aug 16 '24 08:08 leloykun

@zucchini-nlp here's for the rest of the processors: https://github.com/huggingface/transformers/pull/32845

leloykun avatar Aug 16 '24 09:08 leloykun

@leloykun WOW, thanks a lot! I can review those on Monday, today will be a bit busy

zucchini-nlp avatar Aug 16 '24 09:08 zucchini-nlp

Thanks too!

For now, the PR for Chameleon, https://github.com/huggingface/transformers/pull/32181, is the safest to merge as (1) the processor doesn't expect special args (e.g. text_pair and such) and (2) the PR also already has tests

The PRs for Nougat https://github.com/huggingface/transformers/pull/32841 and the LayoutLM models https://github.com/huggingface/transformers/pull/32180 need more thought as their processors expect special args (cc @yonigozlan, I think @molbap is right that our current implementation is kinda wonky)

while the other PRs don't have tests yet

leloykun avatar Aug 16 '24 11:08 leloykun

No problem @davidgxue! I will get started on LLaVa then

yonigozlan avatar Aug 16 '24 14:08 yonigozlan

Summary of my progress:

Model Status Has Tests? Special Args PR
Chameleon Ready Yes - https://github.com/huggingface/transformers/pull/32181
--- --- --- --- ---
AltCLIP Ready Yes - https://github.com/huggingface/transformers/pull/32845
Flava Ready Yes - https://github.com/huggingface/transformers/pull/32845
Git Ready Yes - https://github.com/huggingface/transformers/pull/32845
InstructBlipVideo Ready Yes - https://github.com/huggingface/transformers/pull/32845
LLaVa-NeXT-Video Ready Yes - https://github.com/huggingface/transformers/pull/32845
MGP Ready Yes - https://github.com/huggingface/transformers/pull/32845
Siglip Ready Yes - https://github.com/huggingface/transformers/pull/32845
TVP Ready Yes - https://github.com/huggingface/transformers/pull/32845
VideoLLaVa Ready Yes - https://github.com/huggingface/transformers/pull/32845
VILT Ready Yes - https://github.com/huggingface/transformers/pull/32845
X-CLIP Ready Yes - https://github.com/huggingface/transformers/pull/32845
--- --- --- --- ---
LayoutLMv2 Ready Yes ~~text_pair~~, boxes, word_labels https://github.com/huggingface/transformers/pull/32180
LayoutLMv3 Ready Yes ~~text_pair~~, boxes, word_labels https://github.com/huggingface/transformers/pull/32180
LayoutXLM Ready Yes ~~text_pair~~, boxes, word_labels https://github.com/huggingface/transformers/pull/32180
--- --- --- --- ---
ClipSeg Ready Yes visual_prompt https://github.com/huggingface/transformers/pull/32841
Nougat Ready Yes ~~text_pair, text_target, text_pair_target~~(apparently these are in the tokenizer base class) https://github.com/huggingface/transformers/pull/32841
OwlV2 Ready Yes query_images https://github.com/huggingface/transformers/pull/32841
OwlVIT Ready Yes query_images https://github.com/huggingface/transformers/pull/32841
--- --- --- --- ---
Clap Not Ready No ?? https://github.com/huggingface/transformers/pull/32906
CLVP Not Ready No ?? https://github.com/huggingface/transformers/pull/32906
MusicGen Melody Not Ready No ?? https://github.com/huggingface/transformers/pull/32906
PopPiano Not Ready No ?? https://github.com/huggingface/transformers/pull/32906
Qwen2 Audio Not Ready No ?? https://github.com/huggingface/transformers/pull/32906
Seamless M4T Not Ready No ?? https://github.com/huggingface/transformers/pull/32906
SpeechT5 Ready Yes - https://github.com/huggingface/transformers/pull/32906
Wav2Vec2 Bert Ready Yes - https://github.com/huggingface/transformers/pull/32906

~~https://github.com/huggingface/transformers/pull/32181 & #32845 are now ready for review. I could also decouple InstructBlipVideo, LLaVa-NeXT-Video, Siglip, and VideoLLaVa to a separate PR just to get them out--just lemme know if there's a need to rush.~~

~~The rest have special args that we still have to figure out how to handle. cc @yonigozlan~~

Update: all of the PRs are now ready for review

leloykun avatar Aug 16 '24 16:08 leloykun