transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add UDOP

Open NielsRogge opened this issue 1 year ago • 12 comments

What does this PR do?

This PR adds UDOP as described in Unifying Vision, Text, and Layout for Universal Document Processing.

The model can be seen as an encoder-decoder Transformer with LayoutLMv3 as encoder and a T5 text decoder.

Fixes #20650

To do:

  • [ ] fix tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default
  • [x] include pytesseract decodings in processor test
  • [ ] check forward signature of the model as we can't change this afterwards
  • [ ] update organization to microsoft

NielsRogge avatar Apr 22 '23 15:04 NielsRogge

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

hi @NielsRogge thank you for pushing this PR. I haven't had the chance to try yet, but I'm curious if you have an example or have tried to perform a torch.jit.trace or onnx conversion on UDOP yet? I know with the previous PR that was where I got stuck.

plamb-viso avatar May 01 '23 18:05 plamb-viso

@plamb-viso My impression was always that tracing Encoder-Decoder models (e.g. BART) works fine but exporting to ONNX is challenging using jit.trace. There's a research example for BART on how to do that: Bart + Beam Search to ONNX

I think this part of the reason the ONNX export is now offloaded into optimum: https://github.com/huggingface/transformers/issues/14222#issuecomment-1432960827

dtiarks avatar May 03 '23 09:05 dtiarks

Just want to make sure with the UdopProcessor that we need to manually add the task to each input string. For e.g. if I'm doing document classification, I need to add document classification. and [0,0,0,0] to my words and bboxes before they go through the processor

For e.g.:

        prompt_text = ['document', 'classification.']
        prompt_boxes = [[0,0,0,0],[0,0,0,0]]
        processor.tokenizer(text=prompt_text, boxes=prompt_boxes)

And prepend these input_ids/boxes to the input_ids/boxes that come out of the processor

(Note that i am using apply_ocr=False)

plamb-viso avatar May 08 '23 13:05 plamb-viso

Also curious how we should encode the label of a training example. Is it a part of the inputs to UdopProcessor?

The I-Code example appears to do it like this

plamb-viso avatar May 08 '23 15:05 plamb-viso

thanks @dtiarks looks like a key component of that script is the BartBeamSearchGenerator which allows you to convert it to torchscript. Will UDOP have something like this?

I tried some of the naive steps I tried in this comment for tracing this new UDOP PR. Looks like the same issues remain. Curious if we'll get a test/example of tracing/compiling/onnx exporting the model either here or in optimum?

EDIT just a naive try at onnx export in optimum: KeyError: "udop is not supported yet.

And just for completeness, a torch.onnx.export gives:

RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":621, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case.  Argument types: Tensor, bool, int, int, Device, bool, NoneType,

Candidates:
	aten::full_like(Tensor self, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
	aten::full_like.out(Tensor self, Scalar fill_value, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)

plamb-viso avatar May 12 '23 15:05 plamb-viso

@plamb-viso Here is the guide to add ONNX export support for a new architecture in Optimum: https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute Feel free to open a PR there and we'll help you if you encounter any issue :slightly_smiling_face:

regisss avatar May 12 '23 16:05 regisss

Highly anticipating this release! :) Keep up the great work

Jordy-VL avatar Jun 06 '23 13:06 Jordy-VL

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 30 '23 15:06 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Definitely still highly interested in this work

plamb-viso avatar Jun 30 '23 15:06 plamb-viso

@ArthurZucker does https://github.com/huggingface/transformers/pull/24565 fix the remaining issues of this PR?

dtiarks avatar Jun 30 '23 17:06 dtiarks

not sure it does no! The added tokens was the issue if I remember correctly

ArthurZucker avatar Jul 01 '23 07:07 ArthurZucker

Ok. The question is how we can move this PR forward? @plamb-viso, @Jordy-VL, I (and probably others) are still definitely interested in this.

@NielsRogge are you aware of other issues blocking this PR or do you have other priorities at the moment?

dtiarks avatar Jul 02 '23 11:07 dtiarks

My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be good to get all the tests green and ask for a review to make it ready for a final one! The tokenizer can be updated after wards 🤗 sorry for the wait 😓

ArthurZucker avatar Jul 03 '23 06:07 ArthurZucker

No worries @ArthurZucker ☺️. My comment was not meant to push anyone. I was just interested if I could contribute to speed up the process.

dtiarks avatar Jul 03 '23 08:07 dtiarks

@ArthurZucker the tokenizer is the only thing left to make all tests green. The PR is ready other than that. The only issue that is remaining are the sentinel tokens that the UDOP author defined (T5 has 100 of them, UDOP a lot more). Those are actually only relevant during pre-training, not during fine-tuning. Hence the model is already perfectly usable.

I can only assign core maintainers for review when the CI is more or less green, so will do that once the tokenizer issue is fixed.

NielsRogge avatar Jul 03 '23 13:07 NielsRogge

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

AleRosae avatar Jul 04 '23 10:07 AleRosae

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

You can already check pix2struct ;)

Jordy-VL avatar Jul 04 '23 10:07 Jordy-VL

Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with spm and AddedToken being taken care of!

ArthurZucker avatar Jul 05 '23 02:07 ArthurZucker

You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding.

ArthurZucker avatar Jul 05 '23 06:07 ArthurZucker

Eagerly anticipating this PR being merged. Is there any information on priority of this work and rough timelines? Thank you @ArthurZucker and @NielsRogge for your great work.

sromoam avatar Jul 17 '23 16:07 sromoam

Regarding the priority, not really sure. I won't really have time to dive deep in this before a few weeks. If a contributor wants to work on this feel free to take over!

ArthurZucker avatar Jul 25 '23 13:07 ArthurZucker

Update: we're down to 2 failing tests:

FAILED tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default - AssertionError: {'▁backing': 16057, '▁Brunswick': 29980, 'S[629176 chars]7501} != {'<pad>': 0, '</s>': 1, '<unk>': 2, '▁': 3,[624686 chars]4401}
FAILED tests/models/udop/test_tokenization_udop.py::UdopTokenizationTest::test_save_slow_from_fast_and_reload_fast - ValueError: Non-consecutive added token '('<extra_id_99>', 0.0)' found. Should have index 34602 but has index 33201 in saved vocabulary.

@ArthurZucker can you clarify how you pushed https://huggingface.co/ArthurZ/udop?

NielsRogge avatar Jul 31 '23 12:07 NielsRogge

Eagerly anticipating this PR being merged. Hope there will be a pretrain demo too.

CheungZeeCn avatar Aug 08 '23 08:08 CheungZeeCn

Will have a look and try to re-upload a working tokenizer!

ArthurZucker avatar Aug 08 '23 09:08 ArthurZucker

Eagerly anticipating this PR being merged. Thanks very much for the great work!

qianmingjie avatar Aug 08 '23 18:08 qianmingjie

How I added the tokenizer: (removed the convert token to id logic of regexes)

>>> from transformers import UdopTokenizer
>>> tokenizer = UdopTokenizer("ArthurZ/udop/spiece.model")
>>>  tokenizer.add_tokens(tokenizer.additional_special_tokens)

this currently gives wrong index issues so trying to fix now!

Nothing really works as expected, if we can just wait for #23909 (end of week max ETA) this will be easy!

ArthurZucker avatar Aug 09 '23 09:08 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 03 '23 08:09 github-actions[bot]

Definitely still interested!

plamb-viso avatar Sep 03 '23 13:09 plamb-viso

+1

CheungZeeCn avatar Sep 03 '23 16:09 CheungZeeCn