donut How to get the bounding boxes of the extracted entities?

It would be great if donut has ability to extract the bounding boxes of each entity extracted. The bounding box information is important and useful for visualizing and down stream tasks.

Aug 09 '22 10:08 WeiquanWa

As far as I know, there isn't actually any bounding boxes. The image is encoded into features, but not actual boxes.

If you need boxes, you are better off using traditional OCR + modelling (layoutlmv2/3 are great options for this approach)

Aug 14 '22 18:08 logan-markewich

Hi, thanks to @logan-markewich for the helpful comment :)

donut does not require any bounding box annotation/supervision during the model training. But, as a result, there are no actual boxes in the model output. Instead, you can get an attention heatmap that could be used for your purpose. See Figure 8 of https://arxiv.org/abs/2111.15664 also. The related code line is at:

https://github.com/clovaai/donut/blob/1.0.5/donut/model.py#L492

You may convert the heatmap to bounding boxes. The following link might be useful to you:

https://stackoverflow.com/a/58421765

Hope this helps. Please let me know if you are still confused.

Aug 17 '22 05:08 gwkrsrch

@WeiquanWa , did you manage to get some semblance of bounding boxes or the cross-attention heatmap from the outputs?

I cannot interpret the structure of the output attention maps from "cross_attentions": decoder_output.cross_attentions.

I see it is a tuple of tuples with the outer length being equal to the number of tokens (len(decoder_output.sequences)) but there are 4 sub-tuples inside each of shape torch.Size([1, 16, 1, 1200]). Not sure how to get representative heatmaps from these tensors.

Sep 01 '22 13:09 SamSamhuns

I second this question. I assume these attention masks are to be translated to the prior stage (before they were encoded) to be able to match with the actual image shape, but I can't seem to figure out how to do this.

Like suggested in #31, I think Donut would benefit greatly from returning bounding boxes to allow further post-processing and output validation using fuzzy matching approaches with OCR results.

Sep 05 '22 18:09 leitouran

@gwkrsrch, could you give us the code that was used to generate the heatmap visualization in Figure 8 of the DONUT paper?

Sep 06 '22 06:09 SamSamhuns

I've found updates at #45

Sep 09 '22 07:09 SamSamhuns

donut donut copied to clipboard

How to get the bounding boxes of the extracted entities?

donut
donut copied to clipboard