transformers
transformers copied to clipboard
Add UDOP
What does this PR do?
This PR adds UDOP as described in Unifying Vision, Text, and Layout for Universal Document Processing.
The model can be seen as an encoder-decoder Transformer with LayoutLMv3 as encoder and a T5 text decoder.
Fixes #20650
To do:
- [ ] fix
tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default
- [x] include pytesseract decodings in processor test
- [ ] check forward signature of the model as we can't change this afterwards
- [ ] update organization to
microsoft
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
hi @NielsRogge thank you for pushing this PR. I haven't had the chance to try yet, but I'm curious if you have an example or have tried to perform a torch.jit.trace
or onnx
conversion on UDOP yet? I know with the previous PR that was where I got stuck.
@plamb-viso My impression was always that tracing Encoder-Decoder models (e.g. BART) works fine but exporting to ONNX is challenging using jit.trace. There's a research example for BART on how to do that: Bart + Beam Search to ONNX
I think this part of the reason the ONNX export is now offloaded into optimum: https://github.com/huggingface/transformers/issues/14222#issuecomment-1432960827
Just want to make sure with the UdopProcessor that we need to manually add the task to each input string. For e.g. if I'm doing document classification, I need to add document classification.
and [0,0,0,0]
to my words and bboxes before they go through the processor
For e.g.:
prompt_text = ['document', 'classification.']
prompt_boxes = [[0,0,0,0],[0,0,0,0]]
processor.tokenizer(text=prompt_text, boxes=prompt_boxes)
And prepend these input_ids/boxes to the input_ids/boxes that come out of the processor
(Note that i am using apply_ocr=False)
Also curious how we should encode the label of a training example. Is it a part of the inputs to UdopProcessor
?
The I-Code example appears to do it like this
thanks @dtiarks looks like a key component of that script is the BartBeamSearchGenerator which allows you to convert it to torchscript. Will UDOP have something like this?
I tried some of the naive steps I tried in this comment for tracing this new UDOP PR. Looks like the same issues remain. Curious if we'll get a test/example of tracing/compiling/onnx exporting the model either here or in optimum?
EDIT just a naive try at onnx export in optimum:
KeyError: "udop is not supported yet.
And just for completeness, a torch.onnx.export
gives:
RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":621, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case. Argument types: Tensor, bool, int, int, Device, bool, NoneType,
Candidates:
aten::full_like(Tensor self, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
aten::full_like.out(Tensor self, Scalar fill_value, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)
@plamb-viso Here is the guide to add ONNX export support for a new architecture in Optimum: https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute Feel free to open a PR there and we'll help you if you encounter any issue :slightly_smiling_face:
Highly anticipating this release! :) Keep up the great work
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Definitely still highly interested in this work
@ArthurZucker does https://github.com/huggingface/transformers/pull/24565 fix the remaining issues of this PR?
not sure it does no! The added tokens was the issue if I remember correctly
Ok. The question is how we can move this PR forward? @plamb-viso, @Jordy-VL, I (and probably others) are still definitely interested in this.
@NielsRogge are you aware of other issues blocking this PR or do you have other priorities at the moment?
My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be good to get all the tests green and ask for a review to make it ready for a final one! The tokenizer can be updated after wards 🤗 sorry for the wait 😓
No worries @ArthurZucker ☺️. My comment was not meant to push anyone. I was just interested if I could contribute to speed up the process.
@ArthurZucker the tokenizer is the only thing left to make all tests green. The PR is ready other than that. The only issue that is remaining are the sentinel tokens that the UDOP author defined (T5 has 100 of them, UDOP a lot more). Those are actually only relevant during pre-training, not during fine-tuning. Hence the model is already perfectly usable.
I can only assign core maintainers for review when the CI is more or less green, so will do that once the tokenizer issue is fixed.
Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)
Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)
You can already check pix2struct ;)
Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with spm
and AddedToken
being taken care of!
You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding.
Eagerly anticipating this PR being merged. Is there any information on priority of this work and rough timelines? Thank you @ArthurZucker and @NielsRogge for your great work.
Regarding the priority, not really sure. I won't really have time to dive deep in this before a few weeks. If a contributor wants to work on this feel free to take over!
Update: we're down to 2 failing tests:
FAILED tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default - AssertionError: {'▁backing': 16057, '▁Brunswick': 29980, 'S[629176 chars]7501} != {'<pad>': 0, '</s>': 1, '<unk>': 2, '▁': 3,[624686 chars]4401}
FAILED tests/models/udop/test_tokenization_udop.py::UdopTokenizationTest::test_save_slow_from_fast_and_reload_fast - ValueError: Non-consecutive added token '('<extra_id_99>', 0.0)' found. Should have index 34602 but has index 33201 in saved vocabulary.
@ArthurZucker can you clarify how you pushed https://huggingface.co/ArthurZ/udop?
Eagerly anticipating this PR being merged. Hope there will be a pretrain demo too.
Will have a look and try to re-upload a working tokenizer!
Eagerly anticipating this PR being merged. Thanks very much for the great work!
How I added the tokenizer: (removed the convert token to id logic of regexes)
>>> from transformers import UdopTokenizer
>>> tokenizer = UdopTokenizer("ArthurZ/udop/spiece.model")
>>> tokenizer.add_tokens(tokenizer.additional_special_tokens)
this currently gives wrong index
issues so trying to fix now!
Nothing really works as expected, if we can just wait for #23909 (end of week max ETA) this will be easy!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Definitely still interested!
+1