What does this PR do?

This PR extends VisualQuestionAnsweringPipeline to accept words and boxes as input, passes them into the tokenizer/model (along with the question), and post-processes their QuestionAnsweringModelOutput response.

Fixes #18380

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@Narsil

Aug 02 '22 05:08 ankrgyl

@Narsil this is basically a skeleton implementation that I thought I'd send out sooner than later to start getting your input.

I've left a few questions throughout tagged with "TODO" in the comments. The big question is how much/whether to reuse the code in QuestionAnsweringPipeline, which has a lot of overlap (notably preparing the spans and post-processing the output). For example, I could refactor out methods like QuestionAnsweringPipeline.decode to share the implementation, inherit from QuestionAnsweringPipeline, etc.

Aug 02 '22 05:08 ankrgyl

The documentation is not available anymore as the PR was closed or merged.

Aug 02 '22 05:08 HuggingFaceDocBuilderDev

@Narsil thank you for the review! Before I go in and apply the comments, I thought it might be worth discussing the required (or not) image argument at a high level.

The reason (I think) It's important to allow users to pass in words/boxes instead of an image is that users often either want to run their own OCR (e.g. using a proprietary system like Google/Microsoft) OR are extracting data from documents that have embedded text (e.g. many PDF documents, Word, Excel, etc.). Furthermore, there are a lot of pre-processing tricks that are relevant to certain OCR implementations (e.g. some try to order words by line, others by block, etc.) that have a very significant impact on BERT-inspired models like LayoutLM (because of the attention mechanism, position ids, etc.). tl;dr, having some control over words/boxes is very important if you're trying to use the pipeline in a production scenario.

Now, you could argue that if they want to do this, they could use the question answering pipeline. In fact, when I started exploring HuggingFace/transformers, I did just that! The problem is that if you join everything together (into context), you actually lose some valuable information about how the words are separated in the document (including the distance between them). In other words -- it's very important that you retain information about which words correspond to which coordinates.

I could also see an argument that if a user wants this level of control, they shouldn't use a pipeline in the first place, but the implementation of QA preprocessing and postprocessing are really compelling -- which kind of drew us to really wanting to take advantage of them vs. try to reinvent them elsewhere.

Hopefully that makes sense and adds some context for why I proposed making image optional. I'm very open to alternate solutions too, but just wanted to clarify the use case a bit. For example, another option could be to add a new pipeline called DocumentQuestionAnswering (or similar) that handles inputs of this shape. Let me know your thoughts.

Aug 02 '22 13:08 ankrgyl

@Narsil thank you for the review! Before I go in and apply the comments, I thought it might be worth discussing the required (or not) image argument at a high level.

Yes very much so ! I think it's great to have a clear conversation about this.

I will try and convey this part of the library's perspective, but having your view is great too since I probably know less about OCRs and overall document processing than you.

The reason (I think) It's important to allow users to pass in words/boxes instead of an image is that users often either want to run their own OCR (e.g. using a proprietary system like Google/Microsoft) OR are extracting data from documents that have embedded text (e.g. many PDF documents, Word, Excel, etc.). Furthermore, there are a lot of pre-processing tricks that are relevant to certain OCR implementations (e.g. some try to order words by line, others by block, etc.) that have a very significant impact on BERT-inspired models like LayoutLM (because of the attention mechanism, position ids, etc.). tl;dr, having some control over words/boxes is very important if you're trying to use the pipeline in a production scenario.

This is very interesting to know ! I was under the impression that using an OCR could be streamlined much more.

The fact that the OCR has much impact on the quality of the results doesn't surprise me (and the is_split_into_words might play a non negligible role here)

Now, you could argue that if they want to do this, they could use the question answering pipeline. In fact, when I started exploring HuggingFace/transformers, I did just that! The problem is that if you join everything together (into context), you actually lose some valuable information about how the words are separated in the document (including the distance between them). In other words -- it's very important that you retain information about which words correspond to which coordinates.

Yes I felt the same thing when reading your code and pondering whether it should actually belonged or not in QA instead of VQA. I think you are right, the image information is too valuable to be thrown away.

I could also see an argument that if a user wants this level of control, they shouldn't use a pipeline in the first place, but the implementation of QA preprocessing and postprocessing are really compelling -- which kind of drew us to really wanting to take advantage of them vs. try to reinvent them elsewhere.

Very reasonable ! :)

Hopefully that makes sense and adds some context for why I proposed making image optional. I'm very open to alternate solutions too, but just wanted to clarify the use case a bit. Let me know your thoughts.

Ok, I am going to recap the main goal of the pipeline:

A pipeline is a tool to make ML accessible to non ML practitioners.

That's the first goal, and in doing that, we don't want to hide any ML details that might hurt users unknowingly (like chunking things that can hurt output quality without being opt-in). So hide as many things as possible, when the defaults are correct, but don't hide any magic that could be use-case dependent. For instance, truncating without asking for explicit user consent (via a parameter) means the user will try and send large chunks of texts, and get an output that will correspond only to a tiny chunk of it, without him realizing it.

Another secondary goal is to make them as reusable/extendable as possible, but only when it doesn't contradict any of the previous goals.

With that in mind, you see why having inputs/outputs that depend on the actual model type, forces non ML practitioners to know about model types, where the goal is to try and lift that burden. If we can ensure that sending the same input, will receive the same output, it means users can jump very easily between models. So when AwesomeModelA comes out, you can just swap its name and make it work. Same goes for iterations/fine-tuning of the same model or different models and so one.

Here I can see I think two solutions:

1/ We create a new pipeline (DocumentQuestionAnsweringPipeline ?). The set of I/O is different so we should have different pipelines for these. For this pipeline it seems the input is boxes + words (which I would call texts personally as OCRs probably extract full string and don't necessarily reason about words). It's easy, but puts all the burden of the OCR on the user upfront.(If OCR choice is super tricky and we cannot realistically make that choice in a general fashion for users, it's probably the way to go).

2/ We keep using VisualQuestionAnswering but we enable a very easy way to use a custom OCR:

Most users will trigger an initial error that pytesseract (or something else) is not present and get suggested to install it to get an easy impression about results (mention all the caveats/link to some docs on how to choose the OCR for advanced users).
When those sane defaults are present, the pipelines will use those.
For experienced users that know about how OCR can impact deeply the results we can enable easy overriding like:

pipe = pipeline("mymodel-id", ocr=MyOCR())

class MyOCR:
    def forward(self, image):
       ...
       return texts, boxes

What do you think ? Which solution makes most sense from your perspective ?

Also regardless of choice here, we can extract whatever makes sense as an individual function within qa so you can reuse it, in a pipeline or anywhere else.

Aug 02 '22 14:08 Narsil

For Option 1, to clarify, would you be open to allowing the user to pass in (optional) words and boxes? I think this is conceptually similar to your point about audio pipelines using ffmpeg but I may be misunderstanding something. Essentially, we'd run OCR on the image if the words/boxes are not passed in. And either way, depending on the model, pass the image into the model as well. If we made the words/boxes an optional input, then users could basically assert control where they'd like to, but the simple use cases will just work out of the box.

Personally, I think option 1 is the way to go. I can at least sketch out the code as a next step and we can take a look and reevaluate.

Aug 02 '22 14:08 ankrgyl

would you be open to allowing the user to pass in (optional) words and boxes

I have the same uneasiness with any optional inputs. Either the pipeline needs the data or it doesn't. IMO the incoming data should be as strongly typed as possible, and definitely the computation should not depend on what the user actually sent (because then it becomes really hard to reason about what actually happened on a piece of data, which OCR was used ? Were the boxes correct ? etc...).

I feel like I am missing a piece of the puzzle here, so maybe we can do the other way around, let's try to devise what we would actually like to write as a user for this document processing.

IMO the simplest is something like:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Or maybe be a little more strict:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")
# ValueError : This model is a document processing model, and requires an OCR to be able to use this pipeline,
# please pass an OCR. For demos, you can use `from transformers.pipelines import DefaultOCR`

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", ocr=DefaultOCR())

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}, ...]

Aug 02 '22 15:08 Narsil

Ahh, okay, yes I agree that working from these examples is really helpful. Let me first be precise about what is required vs. not:

In all LayoutLM models, words and bounding boxes are technically required. The model itself requires them to be formatted a certain way (e.g. box coordinates are axis aligned and normalized between 0->1000), but it does not impose where they came from. The inspiration is something like "BERT + bounding boxes".
In LayoutLMv2 and v3, the models additionally accept an image (normalized to 224x224) as input. Theoretically, the model is able to use information from the image alongside the encoded words and boxes. Notably, in LayoutLMv1, you do not need to provide the image. And furthermore, you can fine tune v2 and v3 for many use cases without the additional image and achieve similar or in some cases better results.
The LayoutLMv2 and LayoutLMv3 processor classes in transformers optionally accept an apply_ocr argument. If set to True, while doing feature extraction from the image, they'll also use the tesseract library to run OCR and return them back out to caller, so you can pass them into the model. There is some tricky control flow throughout these classes that branches based on whether the user provides their own OCR or not.

I think part of why it's structured this way, or at least one of the advantages, is that in practice, since OCR can be costly (time and $$), many document processing practitioners will run OCR as a pre-processing step, so you can reuse its results across many invocations of extractions/questions/etc. E.g. imagine you were building an app that lets you point at a folder on your computer and then ask the files questions interactively. You'd probably implement this app by first running OCR on each file, and then re-using the OCR each time a user provides a new question as input.

I think with this in mind, there are probably a few different use cases that would be ideal to capture in the pipeline. I fully recognize that some of these may qualify as "more advanced" than the scope of a pipeline, so I'm open to and appreciate your push back on where that may be the case.

Scenario 1: Single file, single question

(your example above)

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Scenario 2: Interactive REPL

(this is an approximation of a real-world use case)

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

img = Image.load("id_card.jpg")
words, boxes = my_favorite_ocr(img)
while True:
  question = input("Ask a question of the image: ")
  print(pipe(image=img, question=question, words=words, boxes=boxes)

Scenario 3: Mixed Media Types

img = rasterize("my_tax_form.pdf")
words, boxes = text_extract("my_tax_form.pdf")

# NOTE: in certain models, e.g. LayoutLMv1, you do not even need to rasterize/pass in the image as input in this case
out = pipe(image=img, question="What is the person's income?", words=words, boxes=boxes)
# out == [{answer: "$10", score:0.8}, {"answer": "$1000", "score": 0.1}]

I can certainly imagine some alternatives:

Words/boxes could be required inputs, and we could simply enforce that the user run OCR (or alternative) before using the pipeline. I think in this case, the image should be considered optional input, simply because certain document processing models take it as input, and others don't.
Another would be to allow the user to provide a more advanced "OCR" input that could accept things like PDFs, spreadsheets, etc. and let it call out to OCR or use something else depending on the media type. I would say from experience, handling various document types is a can of worms and it prevents you from reusing pre-processed results across calls to the pipeline. (I believe this is your second suggestion).
My original suggestion: words/boxes could be optional, and when not provided, we use a default OCR implementation. One more advantage of this approach is that it's consistent with the LayoutLMv2 processor classes. So if a user starts with this pipeline, and then wants to dig one level deeper to the processor, they'll have a familiar pattern.

Let me know your thoughts. In certain options (namely the first), I think it'd be a requirement for it to be a DocumentQuestionAnsweringPipeline since the required inputs are different than the VisualQuestionAnsweringPipeline. In options 2 or 3, that might not be the case. I don't have a strong opinion about this but just wanted to clarify my understanding/thinking.

Aug 02 '22 15:08 ankrgyl

Ok, thanks for all the explanation !

Now I think I am starting to understand it and all the use cases you displayed really make sense !

I think we can ignore layoutlmv1 not requiring the image so we can keep the number of cases rather small. (If you really know what you're doing you could always send an empty image, or we could just make the code in such a way that sending None doesn't break anything without actively trying to sanitize it)

Since the OCR is indeed quite costly (or can come from a non image !) I can really understand why we would need those optional boxes and texts. So let's support them. (We can make the docs extremely clear on that front)

I think example 1 should really be the focus for newcoming users, and we need to be able to support example 2 and example 3 to be usable in prod.

And if a user sends boxes + texts then we can simply skip the OCR part.

Actually, wdyt about having a list of tuples instead of two lists ? Two lists enables having different sized lists which would silently break things, I usually tend to prefer arguments that cannot by design be inconsistent, and lists of tuples cannot have different sizes and will necessarily raise errors when the tuple is unwrapped, so less room for error

I think all 3 examples could become tests so that we make sure that those cases are maintained through time.

I will ping @NielsRogge which is also really involved in vision and might have other insights.

Aug 02 '22 18:08 Narsil

Awesome, I really appreciate you taking the time to dig into this with me. I'll sketch this all out as a next step. And I agree that we can leave the empty image (or None image) as a performance optimization for advanced users. The one thing we'll need to be careful of is that the LayoutLMv1 model gets upset if you do pass in the image (i.e. it's optional for v2/v3 but not for v1 -- v1 does not accept images at all). So if the workaround is to pass in an empty image, we'll just need to figure out a way to cleverly avoid handing it to the model (e.g. implement a no-op feature extractor that takes the image as input and returns an empty dict).

With all of this context in mind, do you have a preference for whether we extend the existing VisualQuestionAnsweringPipeline or isolate this logic into a DocumentQuestionAnsweringPipeline? I'm okay with either, although I am leaning a bit towards the latter so that we can be very clear with the examples/documentation about the use cases (and not muddy the waters with the VisualQuestionAnsweringPipeline which operates directly on the image each time). But of course, I'm open either way.

Actually, wdyt about having a list of tuples instead of two lists ? Two lists enables having different sized lists which would silently break things, I usually tend to prefer arguments that cannot by design be inconsistent, and lists of tuples cannot have different sizes and will necessarily raise errors when the tuple is unwrapped, so less room for error_

I have no concerns with this. The runtime "perf hit" of converting one format to the other is trivial compared to the other operations involved. I think it's a smart way to prevent an accidental length mismatch.

I think all 3 examples could become tests so that we make sure that those cases are maintained through time.

Great point. I'm happy to contribute these.

Aug 02 '22 18:08 ankrgyl

With all of this context in mind, do you have a preference for whether we extend the existing VisualQuestionAnsweringPipeline or isolate this logic into a DocumentQuestionAnsweringPipeline? I'm okay with either,

Go with DocumentQuestionAnsweringPipeline for now then. In general we try to avoid adding pipelines when we can and when the set of input/output is the same as it makes discoverability and usability on hf.co easier/more consistent.

But you made great points explaining core differences (especially the pdf example for instance), IMO. If we decide to revisit later or other members have different opinions, we might revisit later (we would do the lifting, and since we're committed to zero breaking change you would still be able to use your code regardless of internal decisions)

Aug 02 '22 19:08 Narsil

Okay great, as a next step I'll rework this PR to sketch out DocumentQuestionAnsweringPipeline and address some of your comments on the original change (but may not do all in the first pass, just to optimize for getting feedback sooner). Thanks again for the back and forth and look forward to the next iteration!

Aug 02 '22 19:08 ankrgyl

I just pushed an update that moves the logic into a new DocumentQuestionAnsweringPipeline. I still need to do a few major things:

Integrate OCR
Figure out padding (specifically -- using "return_tensors" basically requires padding, so I could either enforce it or do the unsqueeze trick used in the qa pipeline)
Integrate the post-processing from the QA pipeline.

I did some sanity testing with a model we've trained and can confirm that it is starting to work! I think we're headed in the right direction.

Aug 04 '22 06:08 ankrgyl

Figure out padding (specifically -- using "return_tensors" basically requires padding, so I could either enforce it or do the unsqueeze trick used in the qa pipeline)

Not sure I understand, in the pipelines the padding should be done by the pipeline itself, not by the preprocess (It just allows for more flexible control over how things are executed). preprocess only processes 1 input at a time, so padding shouldn't be necessary (it might be activable, like truncating, but I don't think it should be the default)

Aug 04 '22 15:08 Narsil

Not sure I understand, in the pipelines the padding should be done by the pipeline itself, not by the preprocess (It just allows for more flexible control over how things are executed). preprocess only processes 1 input at a time, so padding shouldn't be necessary (it might be activable, like truncating, but I don't think it should be the default)

If I'm understanding the QA pipeline code correctly, the reason padding is relevant is that if you stride a document (e.g. one with more than 512 words), then one item that you preprocess might result multiple inputs to the model that get concatenated together in one big tensor. The question answering pipeline has to solve for this too, and it seems to do that by (a) not returning tensors from tokenize(), and then (b) while constructing the final output, using tensor.unsqueeze(0) (here) to effectively pad each element to the same size.

I'm happy to do it that way if you prefer -- my working assumption was that the "padding" argument to the tokenizer accomplishes the same thing (but certainly may be missing some interesting implementation detail).

Aug 04 '22 17:08 ankrgyl

If I'm understanding the QA pipeline code correctly, the reason padding is relevant is that if you stride a document (e.g. one with more than 512 words), then one item that you preprocess might result multiple inputs to the model that get concatenated together in one big tensor. The question answering pipeline has to solve for this too, and it seems to do that by (a) not returning tensors from tokenize(), and then (b) while constructing the final output, using tensor.unsqueeze(0) (here) to effectively pad each element to the same size.

Ok, this is what I alluded to, QA solves this by using return_overflowing_tokens (and the padding is set to do_no_pad by default).

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/question_answering.py#L283

QA solves this by using ChunkPipeline. If you want to tackle this, you're more than welcome to it, but IMO it's going to be easier to do it in two steps, and two separate PRs.

As a first step I would recommend simply not treating padding, and let sequences too long go to to the model, which will then crash. It's aligned with the "don't hide" anything policy. Some models can handle long range, most cannot, so not trying to hide that fact is IMO a good thing. We can add an auto chunking in a follow UP PR.

Aug 05 '22 12:08 Narsil

As a first step I would recommend simply not treating padding, and let sequences too long go to to the model, which will then crash. It's aligned with the "don't hide" anything policy. Some models can handle long range, most cannot, so not trying to hide that fact is IMO a good thing. We can add an auto chunking in a follow UP PR.

That plan works for me! I'll provide a new update shortly.

Aug 05 '22 14:08 ankrgyl

@Narsil I just updated the PR with a few changes that remove the padding/striding stuff (for now) and add some docs. The next steps are to integrate OCR and then refactor/borrow the post-processing code from the QA pipeline. I'll keep working on that but wanted to post an intermediate update in case you had a chance to take a quick look.

Aug 05 '22 15:08 ankrgyl

@Narsil another thought / question I had while working on the OCR stuff... Currently, both LayoutLMv2 and v3 have a feature extractor which by default applies OCR. By incorporating OCR into the pipeline itself (which I'm doing by just borrowing their code), we essentially take over that functionality. So, a user may have to do something like this:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", tokenizer="layoutlmv3-xxx", feature_extractor=AutoFeatureExtractor.from_pretrained("layoutlmv3-xxx", apply_ocr=False))

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Essentially, we'll want to rely on the pipeline's OCR, not the feature extractor's. However as a result, we make the user experience a bit awkward (since they have to provide "apply_ocr" False in one place). I can think of a few solutions to this:

We could rely on the user providing a feature extractor as input, and then invoke the feature extractor in preprocess(), essentially following the conventions that LayoutLMv2Processor/LayoutLMv3Processor do (call the feature extractor and then the tokenizer). If they provide neither a feature extractor nor words, we can provide a helpful error message that they must provide a feature extractor that returns words. One major downside to this approach is that users of models like LayoutLMv1 will not ever get OCR run for them by the pipeline, but I'm open to implementing a feature extractor for LayoutLMv1 to solve this.
If they provide a feature extractor, we could try to check whether it'll run OCR (e.g. by checking whether its "apply_ocr" attribute is True). If it will, then we can rely on the feature extractor to provide words/boxes. If not, and they haven't passed in words to the pipeline, then we can run OCR. I think the major downside is depending on a non-standard flag (apply_ocr) in the generic pipeline code. I'm not sure how you all think about this tradeoff -- it may be fine to do. A slight variant of this is to test whether after running the feature extractor, we have words and boxes available in its output.
We could just ignore this altogether and let the user be the expert. I.e. if they pass in a feature extractor and have not specified apply_ocr=False, it will run OCR twice (once in the pipeline and once in the feature extractor), which is an unnecessary perf hit, but makes no assumptions about the feature extractor itself.

Let me know your thoughts.

Aug 05 '22 16:08 ankrgyl

@Narsil I think I've implemented all of what we talked about (and apologies in advance if I missed anything). To summarize:

Padding/truncation are gone. I've left them commented out, because we plan to address them as a follow up (in this or a fast-follow PR), but I'm happy to remove those comments too if you prefer.
OCR is integrated. Regarding my question just above, I went down the route of option 2, and check whether the feature extractor returned words/boxes before trying to run OCR, which the pipeline natively supports.
I refactored the tricky postprocessing parts of the QA pipeline into helper functions which I call from the document question answering pipeline.
I've copied the relevant subsets of the code (including PR #18407) and published it here with some examples. Feel free to play around with it!

As a next step, I'd appreciate a quick review from you on these major points to verify whether we're on the right track. I'd like to add the tests and more documentation next (pending your feedback on if we are in a good place with the interface/overall design). I also have a few questions regarding the tests/docs:

The existing tests for both question-answering and visual-question-answering use models published on HF. There aren't (currently that I could find) any reliablen doc qa models. I have published one here, but there's a bit of a soft dependency on PR #18407 because the model we're publishing uses LayoutLMv1. You can access the model w/ remote code enabled, but I'm not sure that's advisable for a test in the repo. It'd also be good to have tests that span multiple models (e.g. v1-v3) because there are some differences in their tokenizers.
Is there any way to use a processor in a pipeline? The reason I ask is that LayoutLMv2 and v3 have some interesting logic encapsulated in their processors (e.g. LayoutLMv2 renames the input to the model from image to pixel_values and v3 to image_features). It'd be great to reuse the logic in those classes within the pipeline. Alternatively, I could just support LayoutLMv1 to start with and we can work on adding support for the other versions in a follow up PR.
Should I add docs anywhere other than the code itself (which I assume would show up here)? For example a place like here) as a tutorial for how document question answering works?

Aug 06 '22 23:08 ankrgyl

@Narsil gentle nudge in case this slipped from your queue :)

Aug 09 '22 14:08 ankrgyl

So rather than extending the VQA pipeline, it seems that the design has been updated to create a separate DocumentQuestionAnswering pipeline?

Yes that's correct.

Also, I'd like to note that there's a new model I'm working on called Donut which solved DocVQA in a generative manner. Donut is generative T5-like model, which simply generates the answer given a question. Would this pipeline be able to support that model as well?

The interface should support it. As input, you provide an image+question (and optional word/box pairs if you've pre-run OCR) and as output you receive an answer + start/end words. For a generative model, I could imagine either omitting the start/end or the pipeline doing its best to find it in the document if it exists.

Code-wise, there may be some refactoring within the pipeline implementation to best support a model like Donut. Very happy to collaborate with you on that.

Aug 09 '22 14:08 ankrgyl

@NielsRogge congrats on pushing Donut -- I just saw it come through. I've integrated it into the pipeline, and it works! The code gets a bit splintered inside the pipeline which now handles the VisionEncoderDecoderModel case a bit differently. I would definitely appreciate feedback on how the control flow handles both cases (CC @Narsil too). I think one thing that would help is if pipelines could accept processors as input. We could potentially capture some of the LayoutLM-specific tokenization logic into a LayoutLMProcessor (similar to LayoutLMv2Processor), and then simply invoke processor-specific commands for each type of model within the pipeline.

Let me know your thoughts. And feel free to take it for a spin! For example, the following commands work:


from transformers import AutoTokenizer, pipeline
nlp = pipeline('document-question-answering', model='naver-clova-ix/donut-base-finetuned-docvqa', tokenizer=AutoTokenizer.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa"), feature_extractor='naver-clova-ix/donut-base-finetuned-docvqa')

nlp("https://templates.invoicehome.com/invoice-template-us-neat-750px.png", "What is the invoice total?")

Aug 13 '22 19:08 ankrgyl

Hi @Narsil, thanks for the feedback. I will address your comments. I appreciate your willingness to pull down the code and get your hands dirty. Please let me know if I can help at all with that. I need to quickly rebase and fix one or two bugs which I will do ASAP (I broke a couple things while adding support for Donut).

Let me roll up a couple of high level questions that are open. I would greatly appreciate your feedback on these:

1 - Is it possible to use Processors in pipelines? I think some (but not a whole lot) of the logic for Donut, and a whole lot of the logic for LayoutLMv2-3 is present in their processor class and would need to be duplicated here otherwise. Likewise, we could probably create a processor for LayoutLMv1 and place some of the logic there.

2 - While working some more on this in real-world scenarios, I realized that for models like Donut, which operate only on the images, grouping things together by page is actually really important (so you can pass in one input per page). I think it might be useful to change the input format to either be a list of images, or something like [(image, [(word, box)])], where each tuple has an image and a list of word/boxes. WDYT?

Aug 22 '22 15:08 ankrgyl

@ankrgyl Here are some tests we can integrate If you're ok (feel free to modify, the important part is to have exact values in the asserts everywhere except run_pipeline_test.

https://github.com/huggingface/transformers/pull/18732/commits/2e8b01cd5e65aa64956e0f5e56e29ea8391c3955

Aug 23 '22 10:08 Narsil

1 - Is it possible to use Processors in pipelines? I think some (but not a whole lot) of the logic for Donut, and a whole lot of the logic for LayoutLMv2-3 is present in their processor class and would need to be duplicated here otherwise. Likewise, we could probably create a processor for LayoutLMv1 and place some of the logic there.

In general, processor should be extremely shallow, and the real logic should actually be in feature_extractor. Leveraging it is not only encouraged but extremely welcome as they can contain model specific details that the pipeline then doesn't have to care aobut.

2 - While working some more on this in real-world scenarios, I realized that for models like Donut, which operate only on the images, grouping things together by page is actually really important (so you can pass in one input per page). I think it might be useful to change the input format to either be a list of images, or something like [(image, [(word, box)])], where each tuple has an image and a list of word/boxes. WDYT?

You mean sharing the question throughout the pipeline ? I don't know, I think we should focus on the simple solution first, see later for different use cases. Sending it at every step is not too hard, and the optimizations should be dwarfed compared to other issues that might occur (like feeding the GPU fast enough and image processing). Happy to be proven wrong I haven't checked (but in my experience tokenizer is rarely a bottleneck)

Anything list, Dataset or generator should probably be handled by the base class, not by the pipeline directly.

Aug 23 '22 10:08 Narsil

@ankrgyl could you rename the PR to better describe what it does (as it doesn't seem to extend the existing VQA pipeline)?

I'll do a second round of review soon.

Aug 23 '22 13:08 NielsRogge

In general, processor should be extremely shallow, and the real logic should actually be in feature_extractor. Leveraging it is not only encouraged but extremely welcome as they can contain model specific details that the pipeline then doesn't have to care aobut.

Okay got it. That makes sense

2 - While working some more on this in real-world scenarios, I realized that for models like Donut, which operate only on the images, grouping things together by page is actually really important (so you can pass in one input per page). I think it might be useful to change the input format to either be a list of images, or something like [(image, [(word, box)])], where each tuple has an image and a list of word/boxes. WDYT?

You mean sharing the question throughout the pipeline ? I don't know, I think we should focus on the simple solution first, see later for different use cases. Sending it at every step is not too hard, and the optimizations should be dwarfed compared to other issues that might occur (like feeding the GPU fast enough and image processing). Happy to be proven wrong I haven't checked (but in my experience tokenizer is rarely a bottleneck)

Anything list, Dataset or generator should probably be handled by the base class, not by the pipeline directly.

No, I'm talking about the case where you're working with a document that has multiple pages. Each page consists of an image and potentially word/box pairs (if OCR is pre-run). In document processing, it's a common request to try to find an answer from more than one page (e.g. find the total from a 2 page invoice). Right now, as constructed, you can only pass one page at a time, since you can pass in at most one image. That means as a user, you'd have to run the pipeline on each page, and then pick the highest confidence answer. Ideally, this logic should live in the pipeline, because the pipeline can have some logic that picks the best answer across pages.

The main reason I'm wondering about it now is that it affects the input shape. For example, if you have a 3 page document, the code could look like:

pages = []
for page in my_pdf.pages():
  pages.append({"image": Image.load(page.image()), "word_boxes": tesseract(page.image())})

pipe(image=pages, question="What is this person's address ?")

I'm ok with addressing this in a follow up too, where we can extend images to also be an array (and expect it to be this shape). I just wanted to flag the scenario sooner than later.

Aug 23 '22 15:08 ankrgyl

@ankrgyl Here are some tests we can integrate If you're ok (feel free to modify, the important part is to have exact values in the asserts everywhere except run_pipeline_test.

2e8b01c

Thanks @Narsil. I've incorporated these tests into a new test suite in the change and am working through them. I will work on expanding the tests next. It would be really helpful to land PR #18407 so I can include tests for LayoutLMv1 too.

A couple things came up while I was integrating the tests:

I think there's something wrong with hf-internal-testing/tiny-random-layoutlmv2. Specifically, if you run the following (w/out any of these changes), you should see an error:


from transformers import AutoModel, AutoProcessor
from PIL import Image

processor = AutoProcessor.from_pretrained("hf-internal-testing/tiny-random-layoutlmv2")
model = AutoModel.from_pretrained("hf-internal-testing/tiny-random-layoutlmv2")
encoding = processor(Image.open('tests/fixtures/tests_samples/COCO/000000039769.png').convert("RGB"), "What is the meaning of life?", return_tensors="pt")
o = model(**encoding)

# ValueError: 'p5' is not in list

However, if you run with microsoft/layoutlmv2-base-uncased instead of hf-internal-testing/tiny-random-layoutlmv2, the above code works. Could there be something incorrectly configured with this model?

I was able to get the test_large_model_pt_layoutlmv2 model to structurally work, however, the model's outputs are so low confidence that the results are inconsistent from run to run (there are several answers with the minimum possible score). I think it might be worth using a fine-tuned one like tiennvcs/layoutlmv2-base-uncased-finetuned-docvqa with a pinned revision. Is that ok?

Aug 24 '22 06:08 ankrgyl

ValueError: 'p5' is not in list

The small model is just configured with much less layers, and p5 is not expected to be there. (The goal is to have a tiny random model)

I don't know what Processor does, but the feature_extractor was working properly if I recall. Therer still was an error in the test but further down in the forward pass because some keys were missing.

Feel free to modify the tiny model as you see fit locally and propose a PR on it (or use a small tiny random model you own and we'll port it back into hf-internal-testing.

I did remove some vision layers (including p5) if something is failing I would consider it a bug, but I am not super familiar with this model's internals.

Aug 25 '22 15:08 Narsil

I did remove some vision layers (including p5) if something is failing I would consider it a bug, but I am not super familiar with this model's internals.

Yes it was failing inside of the forward pass, not in the processor. I only used the processor to demonstrate that the repro did not have to do with the new code in the PR (it's not running in the test, either).

I can explore the mini model but am unfortunately not very familiar with that stuff myself, either. I will take a look but may need some help if I get stuck.

Aug 25 '22 16:08 ankrgyl

@Narsil I think I was able to update the mini model (see PR). I verified locally that with this update + some expected test changes, the test passes.

Aug 26 '22 06:08 ankrgyl

BTW, I'm working on a space, which illustrates the pipeline. It's currently using a frozen version of the pipeline I have saved in another repo, but once we land this PR and PR #18407 I'll update it. You can play with it here: https://huggingface.co/spaces/impira/docquery.

One cool thing to notice is that with the LayoutLM model, the app reuses the OCR'd tokens across questions, so it is very fast to ask multiple questions. Here is a quick video that demonstrates: https://www.loom.com/share/61a09fb6364142c3b85dd880d8a36890.

Aug 29 '22 00:08 ankrgyl

Awesome, let me clean everything up, and update the tests as well. I agree, there's a lot more to do, and I'll keep the momentum going in follow up PRs.

Sep 02 '22 14:09 ankrgyl

@NielsRogge @Narsil @sgugger @patrickvonplaten I think we are pretty much there -- I just updated to address the outstanding comments, and the failing tests look like they are flaky (unrelated to this change). Let me know if there's anything else needed on this PR.

Sep 05 '22 23:09 ankrgyl

Thanks @sgugger. Should I open one issue with all of them or a separate issue for each TODO? Happy to do that.

Sep 06 '22 14:09 ankrgyl

I think one issue with the list of TODOs is fine.

Sep 06 '22 14:09 sgugger

Here is the task with follow ups: https://github.com/huggingface/transformers/issues/18926.

Sep 07 '22 16:09 ankrgyl

Thanks again for your contribution!

Sep 07 '22 17:09 sgugger

transformers
transformers copied to clipboard

[WIP] Extend VisualQuestionAnsweringPipeline to support QuestionAnwering models (e.g. LayoutLM)

What does this PR do?

Before submitting

Who can review?

Scenario 1: Single file, single question

Scenario 2: Interactive REPL

Scenario 3: Mixed Media Types

ValueError: 'p5' is not in list

transformers transformers copied to clipboard

[WIP] Extend VisualQuestionAnsweringPipeline to support QuestionAnwering models (e.g. LayoutLM)

What does this PR do?

Before submitting

Who can review?

Scenario 1: Single file, single question

Scenario 2: Interactive REPL

Scenario 3: Mixed Media Types

ValueError: 'p5' is not in list

transformers
transformers copied to clipboard